W01 - Assignment: Running an analysis of variance¶
Author: Samuel M.H. <samuel.mh@gmail.com>
Date: 17-01-2016
Assignment
The first assignment deals with analysis of variance. Analysis of variance assesses whether the means of two or more groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference. Note that if your research question does not include one quantitative variable, you can use one from your data set just to get some practice with the tool. If your research question does not include a categorical variable, you can categorize one that is quantitative.
Instructions
Run an analysis of variance.
You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).
What to submit
Following completion of the steps described above, create a blog entry where you submit syntax used to run an ANOVA (copied and pasted from your program) along with corresponding output and a few sentences of interpretation.
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
data = pandas.read_csv('../datasets/NESARC/nesarc_pds.csv', low_memory=False)
Tests
# Working with adult peopleb
data = data[(data['AGE']>=18) & (data['AGE']<=65)]
Weight VS RACE
df1 = pandas.DataFrame()
df1['POUNDS'] = data['S1Q24LB'].replace(999, numpy.nan)
df1['RACE'] = data['ETHRACE2A']
df1 = df1.dropna()
# ANOVA
model1 = smf.ols(formula='POUNDS ~ C(RACE)', data=df1).fit()
print(model1.summary())
We see:
Prob (F-statistic): ~0
That means that all the means are not the same with a confidence close to 100%.
# post hoc paired comparisons
wr1 = multi.MultiComparison(df1['POUNDS'], df1['RACE'])
wr1_res = wr1.tukeyhsd()
print(wr1_res.summary())
In the post hoc paired comparisons, the HSD test says there are statistical differences between races in terms of wheight, except when comparing blacks against american indians/Alaska natives.
B.M.I. (Body Mass Index) VS Race
# BMI formula: http://www.epic4health.com/bmiformula.html
df2 = pandas.DataFrame()
df2['POUNDS'] = data['S1Q24LB'].replace(999, numpy.nan)
df2['RACE'] = data['ETHRACE2A']
df2['INCHES'] = (
(data['S1Q24FT'].replace(99, numpy.nan)*12) +
data['S1Q24IN'].replace(99, numpy.nan)
)
df2['BMI'] = (df2['POUNDS']/(df2['INCHES']**2)*703).dropna(axis=0)
df2 = df2.dropna()
# ANOVA
model2 = smf.ols(formula='BMI ~ C(RACE)', data=df2).fit()
print(model2.summary())
We see:
Prob (F-statistic): ~0
That means that the test has a statistical significance.
wr2 = multi.MultiComparison(df2['BMI'], df2['RACE'])
wr2_res = wr2.tukeyhsd()
print(wr2_res.summary())
In the post hoc paired comparisons, we see that the null hypothesis is rejected in all cases but one. Groups 3 and 5 are not very different (american indian - hispanos) in terms of Body Mass Index. That makes sense because they were the original inhabitants of the continent (pre-colonization era).
Alcohol: Quantity VS Type
# S2AQ4E LARGEST NUMBER OF COOLERS CONSUMED ON DAYS WHEN DRANK COOLERS IN LAST 12 MONTHS
# S2AQ4H TYPE OF COOLERS USUALLY CONSUMED IN LAST 12 MONTHS
df3 = pandas.DataFrame()
df3['QUANTITY'] = data['S2AQ4E'].replace(' ', numpy.nan).replace('99',numpy.nan).convert_objects(convert_numeric=True)
df3['TYPE'] = data['S2AQ4H'].replace(' ', numpy.nan).replace('9',numpy.nan).convert_objects(convert_numeric=True)
df3 = df3.dropna(axis=0)
# ANOVA
model3 = smf.ols(formula='QUANTITY ~ C(TYPE)', data=df3).fit()
print(model3.summary())
Again, the p-value is very small, the statistical test is significant.
wr3 = multi.MultiComparison(df3['QUANTITY'], df3['TYPE'])
wr3_res = wr3.tukeyhsd()
print(wr3_res.summary())
The post hoc analysis indicates that there is no statistical difference between liquor-based coolers and cocktails/mixed drinks, and, malt-based coolers and cocktails when people drink in large quantities. My guess is the calimocho hasn't reached the States yet or they just ignore wine when getting drunk ;) .
No comments:
Post a Comment