Samuel M.H. 's technological blog

Sunday, March 13, 2016

Test a Logistic Regression Model

Notebook

04 Test a Logistic Regression Model

Author: Samuel M.H. <samuel.mh@gmail.com> Date: 13-03-2016

This week's assignment is to test a logistic regression model.

Data preparation for this assignment:

  1. If your response variable is categorical with more than two categories, you will need to collapse it down to two categories, or subset your data to select observations from 2 categories.
  2. If your response variable is quantitative, you will need to bin it into two categories.

The assignment:

Write a blog entry that summarize in a few sentences

  1. what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.
  2. Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.
  3. Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables).

What to Submit

Write a blog entry and submit the URL for your blog. Your blog entry should include

  1. the summary of your results that addresses parts 1-3 of the assignment,
  2. the output from your logistic regression model.

Intro

This week I will make a model that describes how probable is a person works 35 or more hours a week.

Dataset

Variables

  • Response:

    • WORK35 -> S1Q7A1: present situation includes working full time (35+ hours a week). Categorical yes/no.
  • Explanatory:

    • AGE -> AGE: age (years).
    • S1Q24LB -> WEIGHT: weight (pounds).
    • NUMPERS -> HOUSE_PEOPLE: number of persons in household.
    • ETHRACE2A -> RACE: imputed race/ethnicity (5 groups, reference group=1,white).
    • SEX -> MALE: gender (2 groups). Used in confounding.
    • S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups). Used in confounding.
In [16]:
%pylab inline

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

pylab.rcParams['figure.figsize'] = (15, 8)
Populating the interactive namespace from numpy and matplotlib

Data

In [2]:
# Load data
data = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['S1Q7A1','AGE','S1Q24LB','NUMPERS','ETHRACE2A','SEX','S10Q1A63'])
In [17]:
# Custom dataframe
df = pd.DataFrame()

# Response variable
df['WORK35'] = data['S1Q7A1'].replace(' ',np.NaN).replace('2','0').astype(float)

# Explanatory variables
df['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)
df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)
df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)
df['RACE'] = data['ETHRACE2A'].replace(' ',np.NaN)
df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype(float)
df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype(float)

df = df.dropna()

Data summaries: counts and descriptions

In [4]:
df['WORK35'].value_counts()
Out[4]:
1    20942
0    19462
Name: WORK35, dtype: int64
In [5]:
df['RACE'].value_counts()
Out[5]:
1    23023
5     7861
2     7628
4     1238
3      654
Name: RACE, dtype: int64
In [6]:
pd.crosstab(df['WORK35'],df['RACE'])
Out[6]:
RACE 1 2 3 4 5
WORK35
0 11406 3629 333 557 3537
1 11617 3999 321 681 4324
In [7]:
df[['AGE','WEIGHT','HOUSE_PEOPLE']].describe().round(3)
Out[7]:
AGE WEIGHT HOUSE_PEOPLE
count 40404.000 40404.000 40404.000
mean 46.277 170.754 2.521
std 18.157 41.376 1.495
min 18.000 62.000 1.000
25% 32.000 140.000 1.000
50% 44.000 165.000 2.000
75% 59.000 193.000 3.000
max 97.000 500.000 17.000

Centered means

In [8]:
# Center means
for c in ['AGE','WEIGHT','HOUSE_PEOPLE']:
    df[c] = df[c]-df[c].mean()

df[['AGE','WEIGHT','HOUSE_PEOPLE']].describe()
Out[8]:
AGE WEIGHT HOUSE_PEOPLE
count 4.040400e+04 4.040400e+04 4.040400e+04
mean -9.454207e-16 2.160962e-15 -3.657878e-17
std 1.815653e+01 4.137566e+01 1.495105e+00
min -2.827685e+01 -1.087541e+02 -1.521458e+00
25% -1.427685e+01 -3.075406e+01 -1.521458e+00
50% -2.276854e+00 -5.754059e+00 -5.214583e-01
75% 1.272315e+01 2.224594e+01 4.785417e-01
max 5.072315e+01 3.292459e+02 1.447854e+01

Logit regression model

Model

In [9]:
model = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + C(RACE)',data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.616662
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 WORK35   No. Observations:                40404
Model:                          Logit   Df Residuals:                    40396
Method:                           MLE   Df Model:                            7
Date:                Sun, 13 Mar 2016   Pseudo R-squ.:                  0.1095
Time:                        22:35:26   Log-Likelihood:                -24916.
converged:                       True   LL-Null:                       -27979.
                                        LLR p-value:                     0.000
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        0.1104      0.014      7.670      0.000         0.082     0.139
C(RACE)[T.2]    -0.1223      0.029     -4.257      0.000        -0.179    -0.066
C(RACE)[T.3]    -0.2179      0.085     -2.553      0.011        -0.385    -0.051
C(RACE)[T.4]     0.0345      0.063      0.545      0.586        -0.089     0.158
C(RACE)[T.5]    -0.0914      0.029     -3.155      0.002        -0.148    -0.035
AGE             -0.0467      0.001    -65.777      0.000        -0.048    -0.045
WEIGHT           0.0071      0.000     25.725      0.000         0.007     0.008
HOUSE_PEOPLE    -0.0753      0.008     -9.624      0.000        -0.091    -0.060
================================================================================

Odd ratios and p-values

In [10]:
conf = model.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model.params
conf = np.exp(conf)
conf['p-val'] = model.pvalues.round(3)
print(conf)
              Lower CI  Upper CI        OR  p-val
Intercept     1.085672  1.148696  1.116740  0.000
C(RACE)[T.2]  0.836475  0.936155  0.884912  0.000
C(RACE)[T.3]  0.680385  0.950637  0.804238  0.011
C(RACE)[T.4]  0.914488  1.171532  1.035061  0.586
C(RACE)[T.5]  0.862312  0.965978  0.912674  0.002
AGE           0.953006  0.955664  0.954334  0.000
WEIGHT        1.006547  1.007631  1.007089  0.000
HOUSE_PEOPLE  0.913405  0.941834  0.927510  0.000

Looking at the p-values, all the variables are statistically associated with the response variable except RACE=4 (Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino).

All the variables are negatively associated with the likelihood of working >35 hours a week (and being white, the reference for race) but the weight, which is slightly positively correlated. The weight could be the consequence instead of the cause.

Due to the fact that most of the confidence intervals overlap, it is not possible to say which variable is the most strongly associated with the explanatory variable. Nevertheless, these are the intervals that don't overlap (bigger means more explanatory power/association):

  • RACE=3 > WEIGHT > AGE
  • RACE=2 > AGE

The most strong variable seems to be race. In terms of odd ratios, this means that if you are white, you are 1.24 times more likely to work full time than an American indian, 1.13 times than a black or 1.08 times than hispanic. The age and the number of people living in the house are also significant but weak in the result.

Support

In order to evaluate if the specification of the model is correct, I should evaluate the goodness of the fit. Since this exceeds the scope of the course, I am taking the pseudo R² value and improving it by the addition of more explanatory variables.

Improving the model

In this part I will add 2 categorical variables: if the person is a male, and if the person is easily influenced by media or other people.

MALE variable

In [11]:
df['MALE'].value_counts()
Out[11]:
0    22860
1    17544
Name: MALE, dtype: int64
In [12]:
model2 = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + C(RACE) + MALE',data=df).fit()
print(model2.summary())
Optimization terminated successfully.
         Current function value: 0.604865
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 WORK35   No. Observations:                40404
Model:                          Logit   Df Residuals:                    40395
Method:                           MLE   Df Model:                            8
Date:                Sun, 13 Mar 2016   Pseudo R-squ.:                  0.1265
Time:                        22:35:26   Log-Likelihood:                -24439.
converged:                       True   LL-Null:                       -27979.
                                        LLR p-value:                     0.000
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.2162      0.018    -12.029      0.000        -0.251    -0.181
C(RACE)[T.2]    -0.0308      0.029     -1.058      0.290        -0.088     0.026
C(RACE)[T.3]    -0.1999      0.086     -2.320      0.020        -0.369    -0.031
C(RACE)[T.4]    -0.0546      0.064     -0.850      0.395        -0.180     0.071
C(RACE)[T.5]    -0.1088      0.029     -3.705      0.000        -0.166    -0.051
AGE             -0.0472      0.001    -65.469      0.000        -0.049    -0.046
WEIGHT           0.0035      0.000     12.043      0.000         0.003     0.004
HOUSE_PEOPLE    -0.0729      0.008     -9.204      0.000        -0.088    -0.057
MALE             0.7345      0.024     30.641      0.000         0.688     0.781
================================================================================

In [13]:
conf = model2.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model2.params
conf = np.exp(conf)
conf['p-val'] = model2.pvalues.round(3)
print(conf)
              Lower CI  Upper CI        OR  p-val
Intercept     0.777726  0.834487  0.805606  0.000
C(RACE)[T.2]  0.915778  1.026646  0.969628  0.290
C(RACE)[T.3]  0.691552  0.969484  0.818809  0.020
C(RACE)[T.4]  0.834961  1.073867  0.946909  0.395
C(RACE)[T.5]  0.846804  0.950072  0.896953  0.000
AGE           0.952596  0.955289  0.953942  0.000
WEIGHT        1.002966  1.004122  1.003544  0.000
HOUSE_PEOPLE  0.915326  0.944207  0.929655  0.000
MALE          1.988746  2.184676  2.084411  0.000

Now it is possible to see that RACE=2, being black is not statistically associated with the result variable, and the odd ratios for the RACE variable are less significant. This is due to the fact that the gender is a confounding variable as it moderates the impact of the race in the result.

I can also say that by being male, a person is 2 times more likely to have a full time job.

There is also a gain in the pseudo R² value.

CHANGE_MIND variable

In [24]:
df['CHANGE_MIND'].value_counts()
Out[24]:
0    34650
1     5754
Name: CHANGE_MIND, dtype: int64
In [14]:
model3 = smf.logit(formula='WORK35 ~ AGE + WEIGHT + HOUSE_PEOPLE + MALE + CHANGE_MIND',data=df).fit()
print(model3.summary())
Optimization terminated successfully.
         Current function value: 0.604756
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 WORK35   No. Observations:                40404
Model:                          Logit   Df Residuals:                    40398
Method:                           MLE   Df Model:                            5
Date:                Sun, 13 Mar 2016   Pseudo R-squ.:                  0.1267
Time:                        22:35:26   Log-Likelihood:                -24435.
converged:                       True   LL-Null:                       -27979.
                                        LLR p-value:                     0.000
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.2225      0.016    -14.274      0.000        -0.253    -0.192
AGE             -0.0470      0.001    -65.665      0.000        -0.048    -0.046
WEIGHT           0.0036      0.000     12.657      0.000         0.003     0.004
HOUSE_PEOPLE    -0.0785      0.008    -10.045      0.000        -0.094    -0.063
MALE             0.7280      0.024     30.584      0.000         0.681     0.775
CHANGE_MIND     -0.1612      0.031     -5.157      0.000        -0.222    -0.100
================================================================================

In [15]:
conf = model3.conf_int()
conf.columns = ['Lower CI','Upper CI']
conf['OR'] = model3.params
conf = np.exp(conf)
conf['p-val'] = model3.pvalues.round(3)
print(conf)
              Lower CI  Upper CI        OR  p-val
Intercept     0.776395  0.825324  0.800486      0
AGE           0.952730  0.955408  0.954068      0
WEIGHT        1.003088  1.004222  1.003655      0
HOUSE_PEOPLE  0.910446  0.938769  0.924499      0
MALE          1.976625  2.169952  2.071034      0
CHANGE_MIND   0.800541  0.904897  0.851121      0

The addition of the new variable improves the R² value and falls into the confidence margins.

The odd ratio say that if you change your mind about things depending on people you are with or what you read or saw on TV, you are a 17,5% less likely to work more than 35 hours a week. Could this statement be biased by the gender? As the MALE odd ratio is changed, there is evidence to say they are no disjoint groups.

No comments:

Post a Comment

Copyright © Samuel M.H. All rights reserved. Powered by Blogger.