Test a Multiple Regression Model
Author: Samuel M.H. <samuel.mh@gmail.com>
Date: 06-03-2016
Instructions
This week's assignment is to test a multiple regression model.
Data preparation for this assignment:
- If your response variable is categorical, you will need to identify a quantitative variable in the data set that you can use as a response variable for this assignment. Variables with response scales with 4-5 values that represent a change in magnitude (eg, "strongly disagree to strongly agree", "never to often") can be considered quantitative for the assignment.
The assignment:
Write a blog entry that summarize in a few sentences
- what you found in your multiple regression analysis. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary.
- Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable.
- Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables);
- generate the following regression diagnostic plots:
- q-q plot
- standardized residuals for all observations
- leverage plot
- Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.
What to Submit
Submit the URL for your blog entry. The blog entry should include:
- the summary of your results that addresses parts 1-4 of the assignment,
- the output from your multiple regression model, and
- the regression diagnostic plots.
Intro
This week I will try to explain the average daily volume of ethanol consumed per person in past year with a multiple regression model.
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
Variables
Response:
- ETOTLCA2 -> ETHANOL: average daily volume of ethanol consumed in past year (ounces).
Explanatory:
- AGE -> AGE: age (years).
- S1Q24LB -> WEIGHT: weight (pounds).
- NUMPERS -> HOUSE_PEOPLE: number of persons in household.
- S1Q4A -> MARRAIGE: age at first marriage (years).
- S1Q8D -> WORK: age when first worked full time, 30+ hours a week (years).
- S1Q12A -> INCOME: total household income in last 12 months (dolars).
All used variables are quantitative.
%pylab inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
pylab.rcParams['figure.figsize'] = (15, 8)
Data
# Load data
data = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['ETOTLCA2','AGE','S1Q24LB','NUMPERS','S1Q4A','S1Q8D','S1Q12A'])
# Custom dataframe
df = pd.DataFrame()
# Response variable
df['ETHANOL'] = data['ETOTLCA2'].replace(' ',np.NaN).astype(float)
# Explanatory variables
df['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)
df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)
df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)
df['MARRIAGE'] = data['S1Q4A'].replace(' ',np.NaN).replace('99',np.NaN).astype(float)
df['WORK'] = data['S1Q8D'].replace(' ',np.NaN).replace('99',np.NaN).replace('0',np.NaN).astype(float)
df['INCOME'] = data['S1Q12A'].replace(' ',np.NaN).astype(float)
df = df.dropna()
df.describe()
Centered means
# Center means
COLS = ['AGE','WEIGHT','HOUSE_PEOPLE','MARRIAGE','WORK','INCOME']
for c in COLS:
df[c] = df[c]-df[c].mean()
df.describe()
NOTE: in this assignment I won't filter the outliers in the first step.
Multiple regression analysis
Model
model = smf.ols('ETHANOL ~ AGE + WEIGHT + HOUSE_PEOPLE + MARRIAGE + WORK + INCOME', data=df).fit()
print(model.summary())
The model responds to the following formula:
$$ \require{cancel} \widehat{ETHANOL} = 0.4950 - 0.0016*AGE + 0.0014*WEIGHT - 0.0526*HOUSE\_PEOPLE + 0.0051*MARRIAGE - 0.0173*WORK + \cancel{INCOME} $$
I have discarded the $INCOME$ explanatory variable as its p-value is 0.362 so the confidence for its coeficient is lower than the required 95%. It is possible to see the zero value falls into the coeficient margins.
The other coefficients are OK since their p-values are below 0.05. The most explanatory variable is HOUSE_PEOPLE followed by WORK, both with negative correlations with the result variable. The other variables affect less the predicted result by an order of magnitude.
The intercept means that the average person (if such person exists) drinks 0.5 oz. of ethanol a day.
The model is rubbish as it can only explain the 1,1% of the variance (R-squared value).
NOTE: I cannot identify the reason for the second warning. It may come from the data selection, but the explanatory variables seem not to be correlated (at first sight).
q-q plot
qq_plot = sm.qqplot(model.resid, line='r')
As seen in the plot, the residuals do not follow a normal distribution, they differ from the red line at both ends. This indicates the model is not correctly specified. If I could capture other relations (adding more explanatory variables) I could improve the variability explained and get the noise to look like a random signal (residuals would shape like a gaussian).
Standardized residuals
stdres=pd.DataFrame(model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
The plot with the standardized residual shows that there are lot of values that fall more than 2 standard deviations away the mean.
Lets see with a histogram the shape of the residual distribution.
sns.distplot(model.resid_pearson, kde=False, bins=range(-5,25), axlabel='Deviation', label='Count')
The shape resembles a long tail rather than a gaussian. This means the errors are not randomly distributed. We have extreme outliers and the level of error is unacceptable, a poor fit (1.1% of variance captured).
Regression Plots
for c in COLS:
regress = sm.graphics.plot_regress_exog(model, c )
regress.show()
These graphs confirm the model cannot explain the ethanol ingest.
When the bigger residuals are closer to the mean and the shape resembles a gaussian, its time to admit the problem is far more complex (at this moment is randomness) and there are lurking variables involved which I should find to get better results.
For the primary explanatory variable HOUSE_PEOPLE, it seems the more people there are in the house, the lesser the predictiong error (residual) is, but when looking at the partial regression plot, all is messed up with that random pattern :( .
Influence plot
influence =sm.graphics.influence_plot(model,size=2)
There are a lot of extreme outliers influencing the model, this is the final evidence that there are confounding variables an relations not found.
I don't find viable to discard all the outliers in order to regularize the model. One good thing is that the observations with the biggest impact are not outliers.
No comments:
Post a Comment