04 Test a Logistic Regression Model
Author: Samuel M.H. <samuel.mh@gmail.com>
Date: 13-03-2016
This week's assignment is to test a logistic regression model.
Data preparation for this assignment:
- If your response variable is categorical with more than two categories, you will need to collapse it down to two categories, or subset your data to select observations from 2 categories.
- If your response variable is quantitative, you will need to bin it into two categories.
The assignment:
Write a blog entry that summarize in a few sentences
- what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.
- Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.
- Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables).
What to Submit
Write a blog entry and submit the URL for your blog. Your blog entry should include
- the summary of your results that addresses parts 1-3 of the assignment,
- the output from your logistic regression model.
Intro
This week I will make a model that describes how probable is a person works 35 or more hours a week.
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
Variables
Response:
- WORK35 -> S1Q7A1: present situation includes working full time (35+ hours a week). Categorical yes/no.
Explanatory:
- AGE -> AGE: age (years).
- S1Q24LB -> WEIGHT: weight (pounds).
- NUMPERS -> HOUSE_PEOPLE: number of persons in household.
- ETHRACE2A -> RACE: imputed race/ethnicity (5 groups, reference group=1,white).
- SEX -> MALE: gender (2 groups). Used in confounding.
- S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups). Used in confounding.
%pylab inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
pylab.rcParams['figure.figsize'] = (15, 8)
Data
# Load data
data = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['S1Q7A1','AGE','S1Q24LB','NUMPERS','ETHRACE2A','SEX','S10Q1A63'])
# Custom dataframe
df = pd.DataFrame()
# Response variable
df['WORK35'] = data['S1Q7A1'].replace(' ',np.NaN).replace('2','0').astype(float)
# Explanatory variables
df['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)
df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)
df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)
df['RACE'] = data['ETHRACE2A'].replace(' ',np.NaN)
df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype(float)
df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype(float)
df = df.dropna()
Data summaries: counts and descriptions
df['WORK35'].value_counts()
df['RACE'].value_counts()
pd.crosstab(df['WORK35'],df['RACE'])
df[['AGE','WEIGHT','HOUSE_PEOPLE']].describe().round(3)
Centered means
# Center means
for c in ['AGE','WEIGHT','HOUSE_PEOPLE']:
df[c] = df[c]-df[c].mean()
df[['AGE','WEIGHT','HOUSE_PEOPLE']].describe()
Logit regression model
Model
model = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + C(RACE)',data=df).fit()
print(model.summary())
Odd ratios and p-values
conf = model.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model.params
conf = np.exp(conf)
conf['p-val'] = model.pvalues.round(3)
print(conf)
Looking at the p-values, all the variables are statistically associated with the response variable except RACE=4 (Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino).
All the variables are negatively associated with the likelihood of working >35 hours a week (and being white, the reference for race) but the weight, which is slightly positively correlated. The weight could be the consequence instead of the cause.
Due to the fact that most of the confidence intervals overlap, it is not possible to say which variable is the most strongly associated with the explanatory variable. Nevertheless, these are the intervals that don't overlap (bigger means more explanatory power/association):
- RACE=3 > WEIGHT > AGE
- RACE=2 > AGE
The most strong variable seems to be race. In terms of odd ratios, this means that if you are white, you are 1.24 times more likely to work full time than an American indian, 1.13 times than a black or 1.08 times than hispanic. The age and the number of people living in the house are also significant but weak in the result.
Support
In order to evaluate if the specification of the model is correct, I should evaluate the goodness of the fit. Since this exceeds the scope of the course, I am taking the pseudo R² value and improving it by the addition of more explanatory variables.
Improving the model
In this part I will add 2 categorical variables: if the person is a male, and if the person is easily influenced by media or other people.
MALE variable
df['MALE'].value_counts()
model2 = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + C(RACE) + MALE',data=df).fit()
print(model2.summary())
conf = model2.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model2.params
conf = np.exp(conf)
conf['p-val'] = model2.pvalues.round(3)
print(conf)
Now it is possible to see that RACE=2, being black is not statistically associated with the result variable, and the odd ratios for the RACE variable are less significant. This is due to the fact that the gender is a confounding variable as it moderates the impact of the race in the result.
I can also say that by being male, a person is 2 times more likely to have a full time job.
There is also a gain in the pseudo R² value.
CHANGE_MIND variable
df['CHANGE_MIND'].value_counts()
model3 = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + MALE + CHANGE_MIND',data=df).fit()
print(model3.summary())
conf = model3.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model3.params
conf = np.exp(conf)
conf['p-val'] = model3.pvalues.round(3)
print(conf)
The addition of the new variable improves the R² value and falls into the confidence margins.
The odd ratio say that if you change your mind about things depending on people you are with or what you read or saw on TV, you are a 17,5% less likely to work more than 35 hours a week. Could this statement be biased by the gender? As the MALE odd ratio is changed, there is evidence to say they are no disjoint groups.
No comments:
Post a Comment