W02 - Chi-Square Test of Independence¶
Instructions
Run a Chi-Square Test of Independence.
You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).
Note: although it is possible to run large Chi-Square tables (e.g. 5 x 5, 4 x 6, etc.), the test is really only interpretable when your response variable has only 2 levels (see Graphing decisions flow chart in Bivariate Graphing chapter).
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
%matplotlib inline
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('../datasets/NESARC/nesarc_pds.csv', low_memory=False)
Test of independence - $\chi^2$
I am testing if the race (explanatory variable - signal ETHRACE2A) is related to have had problems with a neighbor, friend or relative in the last twelve months (response variable - signal S1Q239).
df1 = pandas.DataFrame()
problems = {1:'Y',2:'N'}
df1['PROBLEMS'] = data['S1Q239'].replace(9, numpy.nan).map(problems)
races = {1:'white',2:'black',3:'indian(US)',4:'asian',5:'latino'}
df1['RACE'] = data['ETHRACE2A'].map(races)
df1 = df1.dropna()
#Contingency table, observations
ct1 = pandas.crosstab(df1['PROBLEMS'],df1['RACE'])
print ct1
#Percentages
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print (colpct)
NOTE: summed by column (explanatory variable) because I want to know how the problems rate vary on different groups.
# chi-square test
cs1 = scipy.stats.chi2_contingency(ct1)
print("X² Value = {0}".format(cs1[0]))
print("p-value = {0}".format(cs1[1]))
The $\chi^2$ test of indepence gives a p-value lesser than 0.05, so the race and having problems are significantly associated.
Not all problem rates are equal across race categories.
Post hoc test - Bonferroni Adjustment
- Number of categories: 5
- Number of comparisons: $\binom{5}{2} = 10$
- Adjusted p-value: $\frac{p-value}{number of comparisons} = \frac{0.05}{10} = 0.005$
from itertools import combinations
comparison_pairs = list(combinations(races.values(),2))
ap_val = 0.05/len(comparison_pairs) #Adjusted p-value
for (v1,v2) in comparison_pairs:
df2 = df1[(df1['RACE']==v1) | (df1['RACE']==v2)]
ct2 = pandas.crosstab(df2['PROBLEMS'],df2['RACE'])
cs2 = scipy.stats.chi2_contingency(ct2)
print("PAIR: {0}-{1}".format(v1,v2,cs2[1],cs2[1]<0.05))
print("\t p-value: {0}".format(cs2[1]))
print("\t Reject: {0}".format(cs2[1]<ap_val))
The following matrix shows with an "x
" when the null hypothesis $H_0$ can be rejected and the variables are not independent.
W | B | I | A | L | |
---|---|---|---|---|---|
W | |||||
B | - | ||||
I | x | x | |||
A | x | x | x | ||
L | x | - | x | - |
The letter stands for the first letter of the race, i.e. W-> white, A->asian
No comments:
Post a Comment