Samuel M.H. 's technological blog

Monday, January 25, 2016

W02 - Chi-Square Test of Independence


W02 - Chi-Square Test of Independence


Run a Chi-Square Test of Independence.

You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).

Note: although it is possible to run large Chi-Square tables (e.g. 5 x 5, 4 x 6, etc.), the test is really only interpretable when your response variable has only 2 levels (see Graphing decisions flow chart in Bivariate Graphing chapter).


In [2]:
%matplotlib inline

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('../datasets/NESARC/nesarc_pds.csv', low_memory=False)

Test of independence - $\chi^2$

I am testing if the race (explanatory variable - signal ETHRACE2A) is related to have had problems with a neighbor, friend or relative in the last twelve months (response variable - signal S1Q239).

In [3]:
df1 = pandas.DataFrame()
problems = {1:'Y',2:'N'}
df1['PROBLEMS'] = data['S1Q239'].replace(9, numpy.nan).map(problems)
races = {1:'white',2:'black',3:'indian(US)',4:'asian',5:'latino'}
df1['RACE'] = data['ETHRACE2A'].map(races)
df1 = df1.dropna()
In [4]:
#Contingency table, observations
ct1 = pandas.crosstab(df1['PROBLEMS'],df1['RACE'])
print ct1
RACE      asian  black  indian(US)  latino  white
N          1277   7711         618    7846  22855
Y            46    447          79     399   1449

In [5]:
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print (colpct)
RACE         asian     black  indian(US)    latino    white
N         0.965231  0.945207    0.886657  0.951607  0.94038
Y         0.034769  0.054793    0.113343  0.048393  0.05962

NOTE: summed by column (explanatory variable) because I want to know how the problems rate vary on different groups.

In [6]:
# chi-square test
cs1 = scipy.stats.chi2_contingency(ct1)
print("X² Value = {0}".format(cs1[0]))
print("p-value = {0}".format(cs1[1]))
X² Value = 68.8411951763
p-value = 3.98639461627e-14

The $\chi^2$ test of indepence gives a p-value lesser than 0.05, so the race and having problems are significantly associated.

Not all problem rates are equal across race categories.

Post hoc test - Bonferroni Adjustment

  • Number of categories: 5
  • Number of comparisons: $\binom{5}{2} = 10$
  • Adjusted p-value: $\frac{p-value}{number of comparisons} = \frac{0.05}{10} = 0.005$
In [37]:
from itertools import combinations
comparison_pairs = list(combinations(races.values(),2))
ap_val = 0.05/len(comparison_pairs) #Adjusted p-value

for (v1,v2) in comparison_pairs:
    df2 = df1[(df1['RACE']==v1) | (df1['RACE']==v2)]
    ct2 = pandas.crosstab(df2['PROBLEMS'],df2['RACE'])
    cs2 = scipy.stats.chi2_contingency(ct2)
    print("PAIR: {0}-{1}".format(v1,v2,cs2[1],cs2[1]<0.05))
    print("\t p-value: {0}".format(cs2[1]))
    print("\t Reject: {0}".format(cs2[1]<ap_val))
PAIR: white-black
  p-value: 0.113799397668
  Reject: False
PAIR: white-indian(US)
  p-value: 8.5313833522e-09
  Reject: True
PAIR: white-asian
  p-value: 0.000219535658413
  Reject: True
PAIR: white-latino
  p-value: 0.000157439063204
  Reject: True
PAIR: black-indian(US)
  p-value: 5.88958554652e-10
  Reject: True
PAIR: black-asian
  p-value: 0.00291931342441
  Reject: True
PAIR: black-latino
  p-value: 0.0691126102707
  Reject: False
PAIR: indian(US)-asian
  p-value: 6.39541615117e-12
  Reject: True
PAIR: indian(US)-latino
  p-value: 4.7511807716e-13
  Reject: True
PAIR: asian-latino
  p-value: 0.0345111728606
  Reject: False

The following matrix shows with an "x" when the null hypothesis $H_0$ can be rejected and the variables are not independent.

B -
I x x
A x x x
L x - x -

The letter stands for the first letter of the race, i.e. W-> white, A->asian

No comments:

Post a Comment

Copyright © Samuel M.H. All rights reserved. Powered by Blogger.