samucoder

Samuel M.H. 's technological blog

Notebook

W01 - Assignment: Running an analysis of variance¶

Author: Samuel M.H. <samuel.mh@gmail.com> Date: 17-01-2016

Assignment

The first assignment deals with analysis of variance. Analysis of variance assesses whether the means of two or more groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference. Note that if your research question does not include one quantitative variable, you can use one from your data set just to get some practice with the tool. If your research question does not include a categorical variable, you can categorize one that is quantitative.

Instructions

Run an analysis of variance.

You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).

What to submit

Following completion of the steps described above, create a blog entry where you submit syntax used to run an ANOVA (copied and pasted from your program) along with corresponding output and a few sentences of interpretation.

Dataset

In [1]:
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi



Tests

In [2]:
# Working with adult peopleb
data = data[(data['AGE']>=18) & (data['AGE']<=65)]


Weight VS RACE

In [3]:
df1 = pandas.DataFrame()
df1['POUNDS'] = data['S1Q24LB'].replace(999, numpy.nan)
df1['RACE'] = data['ETHRACE2A']
df1 = df1.dropna()

# ANOVA
model1 = smf.ols(formula='POUNDS ~ C(RACE)', data=df1).fit()
print(model1.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 POUNDS   R-squared:                       0.029
Method:                 Least Squares   F-statistic:                     256.4
Date:                Sun, 17 Jan 2016   Prob (F-statistic):          1.88e-217
Time:                        23:14:43   Log-Likelihood:            -1.7596e+05
No. Observations:               34185   AIC:                         3.519e+05
Df Residuals:                   34180   BIC:                         3.520e+05
Df Model:                           4
Covariance Type:            nonrobust
================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept      173.3740      0.306    565.957      0.000       172.774   173.974
C(RACE)[T.2]     8.5190      0.594     14.338      0.000         7.355     9.684
C(RACE)[T.3]     5.1839      1.756      2.951      0.003         1.741     8.627
C(RACE)[T.4]   -27.8177      1.254    -22.189      0.000       -30.275   -25.360
C(RACE)[T.5]    -7.2965      0.576    -12.677      0.000        -8.425    -6.168
==============================================================================
Omnibus:                     5804.766   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12118.254
Skew:                           1.016   Prob(JB):                         0.00
Kurtosis:                       5.093   Cond. No.                         8.19
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



We see:

Prob (F-statistic): ~0


That means that all the means are not the same with a confidence close to 100%.

In [4]:
# post hoc paired comparisons
wr1 = multi.MultiComparison(df1['POUNDS'], df1['RACE'])
wr1_res = wr1.tukeyhsd()
print(wr1_res.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
1      2     8.519    6.8983  10.1398   True
1      3     5.1839   0.3925   9.9753   True
1      4    -27.8177 -31.2375 -24.3978  True
1      5    -7.2965  -8.8666  -5.7265   True
2      3    -3.3352  -8.2533   1.583   False
2      4    -36.3367 -39.9319 -32.7415  True
2      5    -15.8156 -17.7379 -13.8933  True
3      4    -33.0015 -38.7684 -27.2347  True
3      5    -12.4804 -17.3821 -7.5788   True
4      5    20.5211  16.9485  24.0937   True
-----------------------------------------------



In the post hoc paired comparisons, the HSD test says there are statistical differences between races in terms of wheight, except when comparing blacks against american indians/Alaska natives.

B.M.I. (Body Mass Index) VS Race

In [5]:
# BMI formula: http://www.epic4health.com/bmiformula.html
df2 = pandas.DataFrame()
df2['POUNDS'] = data['S1Q24LB'].replace(999, numpy.nan)
df2['RACE'] = data['ETHRACE2A']
df2['INCHES'] = (
(data['S1Q24FT'].replace(99, numpy.nan)*12) +
data['S1Q24IN'].replace(99, numpy.nan)
)
df2['BMI'] = (df2['POUNDS']/(df2['INCHES']**2)*703).dropna(axis=0)
df2 = df2.dropna()

# ANOVA
model2 = smf.ols(formula='BMI ~ C(RACE)', data=df2).fit()
print(model2.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    BMI   R-squared:                       0.027
Method:                 Least Squares   F-statistic:                     236.4
Date:                Sun, 17 Jan 2016   Prob (F-statistic):          1.44e-200
Time:                        23:14:43   Log-Likelihood:            -1.0806e+05
No. Observations:               34141   AIC:                         2.161e+05
Df Residuals:                   34136   BIC:                         2.162e+05
Df Model:                           4
Covariance Type:            nonrobust
================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       26.6246      0.042    630.786      0.000        26.542    26.707
C(RACE)[T.2]     1.9835      0.082     24.224      0.000         1.823     2.144
C(RACE)[T.3]     0.9798      0.242      4.049      0.000         0.506     1.454
C(RACE)[T.4]    -2.6754      0.173    -15.480      0.000        -3.014    -2.337
C(RACE)[T.5]     0.5422      0.079      6.830      0.000         0.387     0.698
==============================================================================
Omnibus:                     8374.623   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            24619.619
Skew:                           1.279   Prob(JB):                         0.00
Kurtosis:                       6.280   Cond. No.                         8.18
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



We see:

Prob (F-statistic): ~0


That means that the test has a statistical significance.

In [6]:
wr2 = multi.MultiComparison(df2['BMI'], df2['RACE'])
wr2_res = wr2.tukeyhsd()
print(wr2_res.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
1      2     1.9835   1.7601  2.2069  True
1      3     0.9798   0.3198  1.6398  True
1      4    -2.6754  -3.1469 -2.2039  True
1      5     0.5422   0.3257  0.7588  True
2      3    -1.0037  -1.6812 -0.3262  True
2      4    -4.6589  -5.1545 -4.1633  True
2      5    -1.4413  -1.7064 -1.1762  True
3      4    -3.6552  -4.4498 -2.8606  True
3      5    -0.4376  -1.1128  0.2377 False
4      5     3.2176   2.725   3.7102  True
---------------------------------------------



In the post hoc paired comparisons, we see that the null hypothesis is rejected in all cases but one. Groups 3 and 5 are not very different (american indian - hispanos) in terms of Body Mass Index. That makes sense because they were the original inhabitants of the continent (pre-colonization era).

Alcohol: Quantity VS Type

In [9]:
# S2AQ4E LARGEST NUMBER OF COOLERS CONSUMED ON DAYS WHEN DRANK COOLERS IN LAST 12 MONTHS
# S2AQ4H TYPE OF COOLERS USUALLY CONSUMED IN LAST 12 MONTHS
df3 = pandas.DataFrame()
df3['QUANTITY'] =  data['S2AQ4E'].replace(' ', numpy.nan).replace('99',numpy.nan).convert_objects(convert_numeric=True)
df3['TYPE'] = data['S2AQ4H'].replace(' ', numpy.nan).replace('9',numpy.nan).convert_objects(convert_numeric=True)
df3 = df3.dropna(axis=0)

# ANOVA
model3 = smf.ols(formula='QUANTITY ~ C(TYPE)', data=df3).fit()
print(model3.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:               QUANTITY   R-squared:                       0.013
Method:                 Least Squares   F-statistic:                     34.44
Date:                Sun, 17 Jan 2016   Prob (F-statistic):           4.21e-22
Time:                        23:45:28   Log-Likelihood:                -18300.
No. Observations:                7568   AIC:                         3.661e+04
Df Residuals:                    7564   BIC:                         3.664e+04
Df Model:                           3
Covariance Type:            nonrobust
==================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept          2.2146      0.050     44.583      0.000         2.117     2.312
C(TYPE)[T.2.0]     0.5129      0.069      7.470      0.000         0.378     0.648
C(TYPE)[T.3.0]     0.8952      0.115      7.811      0.000         0.671     1.120
C(TYPE)[T.4.0]     0.7687      0.122      6.325      0.000         0.530     1.007
==============================================================================
Omnibus:                    11607.693   Durbin-Watson:                   2.005
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         15804723.726
Skew:                           9.176   Prob(JB):                         0.00
Kurtosis:                     226.123   Cond. No.                         4.86
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



Again, the p-value is very small, the statistical test is significant.

In [8]:
wr3 = multi.MultiComparison(df3['QUANTITY'], df3['TYPE'])
wr3_res = wr3.tukeyhsd()
print(wr3_res.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
1.0    2.0    0.5129   0.3365 0.6893  True
1.0    3.0    0.8952   0.6007 1.1897  True
1.0    4.0    0.7687   0.4564 1.081   True
2.0    3.0    0.3823   0.0903 0.6743  True
2.0    4.0    0.2558  -0.0541 0.5657 False
3.0    4.0   -0.1265  -0.5159 0.2629 False
--------------------------------------------



The post hoc analysis indicates that there is no statistical difference between liquor-based coolers and cocktails/mixed drinks, and, malt-based coolers and cocktails when people drink in large quantities. My guess is the calimocho hasn't reached the States yet or they just ignore wine when getting drunk ;) .