# samucoder

Samuel M.H. 's technological blog

Notebook

# W02 - Chi-Square Test of Independence¶

## Instructions

Run a Chi-Square Test of Independence.

You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).

Note: although it is possible to run large Chi-Square tables (e.g. 5 x 5, 4 x 6, etc.), the test is really only interpretable when your response variable has only 2 levels (see Graphing decisions flow chart in Bivariate Graphing chapter).

## Dataset

In :
%matplotlib inline

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt



## Test of independence - $\chi^2$

I am testing if the race (explanatory variable - signal ETHRACE2A) is related to have had problems with a neighbor, friend or relative in the last twelve months (response variable - signal S1Q239).

In :
df1 = pandas.DataFrame()
problems = {1:'Y',2:'N'}
df1['PROBLEMS'] = data['S1Q239'].replace(9, numpy.nan).map(problems)
races = {1:'white',2:'black',3:'indian(US)',4:'asian',5:'latino'}
df1['RACE'] = data['ETHRACE2A'].map(races)
df1 = df1.dropna()

In :
#Contingency table, observations
ct1 = pandas.crosstab(df1['PROBLEMS'],df1['RACE'])
print ct1

RACE      asian  black  indian(US)  latino  white
PROBLEMS
N          1277   7711         618    7846  22855
Y            46    447          79     399   1449


In :
#Percentages
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print (colpct)

RACE         asian     black  indian(US)    latino    white
PROBLEMS
N         0.965231  0.945207    0.886657  0.951607  0.94038
Y         0.034769  0.054793    0.113343  0.048393  0.05962



NOTE: summed by column (explanatory variable) because I want to know how the problems rate vary on different groups.

In :
# chi-square test
cs1 = scipy.stats.chi2_contingency(ct1)
print("X² Value = {0}".format(cs1))
print("p-value = {0}".format(cs1))

X² Value = 68.8411951763
p-value = 3.98639461627e-14



The $\chi^2$ test of indepence gives a p-value lesser than 0.05, so the race and having problems are significantly associated.

Not all problem rates are equal across race categories.

## Post hoc test - Bonferroni Adjustment

• Number of categories: 5
• Number of comparisons: $\binom{5}{2} = 10$
• Adjusted p-value: $\frac{p-value}{number of comparisons} = \frac{0.05}{10} = 0.005$
In :
from itertools import combinations
comparison_pairs = list(combinations(races.values(),2))

for (v1,v2) in comparison_pairs:
df2 = df1[(df1['RACE']==v1) | (df1['RACE']==v2)]
ct2 = pandas.crosstab(df2['PROBLEMS'],df2['RACE'])
cs2 = scipy.stats.chi2_contingency(ct2)
print("PAIR: {0}-{1}".format(v1,v2,cs2,cs2<0.05))
print("\t p-value: {0}".format(cs2))
print("\t Reject: {0}".format(cs2<ap_val))


PAIR: white-black
p-value: 0.113799397668
Reject: False
PAIR: white-indian(US)
p-value: 8.5313833522e-09
Reject: True
PAIR: white-asian
p-value: 0.000219535658413
Reject: True
PAIR: white-latino
p-value: 0.000157439063204
Reject: True
PAIR: black-indian(US)
p-value: 5.88958554652e-10
Reject: True
PAIR: black-asian
p-value: 0.00291931342441
Reject: True
PAIR: black-latino
p-value: 0.0691126102707
Reject: False
PAIR: indian(US)-asian
p-value: 6.39541615117e-12
Reject: True
PAIR: indian(US)-latino
p-value: 4.7511807716e-13
Reject: True
PAIR: asian-latino
p-value: 0.0345111728606
Reject: False



The following matrix shows with an "x" when the null hypothesis $H_0$ can be rejected and the variables are not independent.

W B I A L
W
B -
I x x
A x x x
L x - x -

The letter stands for the first letter of the race, i.e. W-> white, A->asian