Samuel M.H. 's technological blog

Saturday, March 26, 2016

Running a Classification Tree (zoom the page to see images)

Notebook

Running a Classification Tree

Author: Samuel M.H. <samuel.mh@gmail.com> Date: 27-03-2016

Instructions

This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.

Run a Classification Tree.

You will need to perform a decision tree analysis to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.

What to Submit

Following completion of the steps described above, create a blog entry where you submit syntax used to run a Classification Tree (copied and pasted from your program) along with corresponding output and a few sentences of interpretation. Please note that your reviewers should NOT be required to download any files in order to complete the review.


Intro

This week I will look for nonlinear relationships in order to predict if a person works 35 or more hours a week. This will be done with decision trees.

Dataset

Variables

  • Response:

    • WORK35 -> S1Q7A1: present situation includes working full time (35+ hours a week). Categorical yes/no.
  • Explanatory:

    • AGE -> AGE: age (years).
    • S1Q24LB -> WEIGHT: weight (pounds).
    • NUMPERS -> HOUSE_PEOPLE: number of persons in household.
    • ETHRACE2A -> RACE: imputed race/ethnicity (5 groups, reference group=1,white).
    • SEX -> MALE: gender (2 groups).
    • S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups).
In [1]:
%pylab inline

import numpy as np
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

#Visualization
from sklearn import tree #Dot format export
from IPython.display import Image #See into notebook
from io import BytesIO #Temporal file, don't use StringIO due to unicode
import pydot #Dot interface

pylab.rcParams['figure.figsize'] = (15, 8)
Populating the interactive namespace from numpy and matplotlib

Data

In [2]:
# Load data
data = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['S1Q7A1','AGE','S1Q24LB','NUMPERS','ETHRACE2A','SEX','S10Q1A63'])
In [3]:
# Custom dataframe
df = pd.DataFrame()

# Response variable
df['WORK35'] = data['S1Q7A1'].replace(' ',np.NaN).replace('2','0').astype(float)

# Explanatory variables
df['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)
df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)
df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)
df['RACE'] = data['ETHRACE2A'].replace(' ',np.NaN).astype('category')
df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype('category')
df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype('
category')

df = df.dropna()
/usr/lib/python2.7/dist-packages/pandas/core/internals.py:4417: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp = (nn == nn_at)

Data summaries: counts and descriptions

In [4]:
df['WORK35'].value_counts()
Out[4]:
1    20942
0    19462
Name: WORK35, dtype: int64
In [5]:
pd.crosstab(df['WORK35'],df['RACE'])
Out[5]:
RACE 1 2 3 4 5
WORK35
0 11406 3629 333 557 3537
1 11617 3999 321 681 4324
In [6]:
pd.crosstab(df['WORK35'],df['MALE'])
Out[6]:
MALE 1 0
WORK35
0 6509 12953
1 11035 9907
In [7]:
pd.crosstab(df['WORK35'],df['CHANGE_MIND'])
Out[7]:
CHANGE_MIND 1.0 0
WORK35
0 2887 16575
1 2867 18075
In [8]:
df[['AGE','WEIGHT','HOUSE_PEOPLE']].describe().round(3)
Out[8]:
AGE WEIGHT HOUSE_PEOPLE
count 40404.000 40404.000 40404.000
mean 46.277 170.754 2.521
std 18.157 41.376 1.495
min 18.000 62.000 1.000
25% 32.000 140.000 1.000
50% 44.000 165.000 2.000
75% 59.000 193.000 3.000
max 97.000 500.000 17.000

Split: train, test

In [9]:
TARGET = 'WORK35'
PREDICTORS = list(df.columns)
PREDICTORS.remove(TARGET)

df_target = df[TARGET]
df_predictors = df[PREDICTORS]

train_target, test_target, train_predictors, test_predictors = train_test_split(df_target, df_predictors, test_size=0.4, random_state=42)

print('Samples train: {0}'.format(len(train_target)))
print('Samples test:  {0}'.format(len(test_target)))
Samples train: 24242
Samples test:  16162

Model

In [10]:
model1 = DecisionTreeClassifier()
model1 = model1.fit(train_predictors, train_target)

Metrics

In [11]:
predictions=model1.predict(test_predictors)

print('Confusion matrix [[TP,FP],[FN, TN]]')
print(sklearn.metrics.confusion_matrix(test_target,predictions))

print('\nAccuracy')
print(sklearn.metrics.accuracy_score(test_target, predictions))
Confusion matrix [[TP,FP],[FN, TN]]
[[5168 2570]
 [3036 5388]]

Accuracy
0.653136987997

$$ ACC = \frac{TP+TN}{P+N} $$

Visualization

In [12]:
print('Tree depth: {0} nodes'.format(model1.tree_.max_depth))
Tree depth: 35 nodes

In [13]:
out = BytesIO()
tree.export_graphviz(model1, out_file=out, max_depth=4, feature_names=PREDICTORS)
graph=pydot.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
Out[13]:

Thoughts

The accuracy of the model is only 0.655, which is slightly better than a random choose of 0.5. I see there is a problem, and maybe the depth of the tree, 35 nodes, is a symbol of overfitting and is its generalization power. Lets try a model limiting the maximum depth to only 4 nodes.

Improving the model

In [14]:
model2 = DecisionTreeClassifier(max_depth=4)
model2 = model2.fit(train_predictors, train_target)

Metrics

In [15]:
predictions = model2.predict(test_predictors)

print('Confusion matrix [[TP,FP],[FN, TN]]')
print(sklearn.metrics.confusion_matrix(test_target,predictions))

print('\nAccuracy')
print(sklearn.metrics.accuracy_score(test_target, predictions))
Confusion matrix [[TP,FP],[FN, TN]]
[[4954 2784]
 [1545 6879]]

Accuracy
0.73214948645

Visualization

In [16]:
out = BytesIO()
tree.export_graphviz(model2, out_file=out, feature_names=PREDICTORS)
graph = pydot.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
Out[16]: