Running a Lasso Regression Analysis
Author: Samuel M.H. <samuel.mh@gmail.com>
Date: 04-03-2016
Instructions
This week’s assignment involves running a lasso regression analysis. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
Your assignment is to run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable.
What to Submit
Following completion of the steps described above, create a blog entry where you submit syntax used to run a lasso regression (copied and pasted from your program) along with corresponding output and a brief written summary. Please note that your reviewers should NOT be required to download any files in order to complete the review.
If your data set has a relatively small number of observations, you do not need to split into training and test data sets. You can provide your rationale for not splitting your data set in your written summary.
Intro
This week I will try to explain the average daily volume of ethanol consumed per person in past year with a LASSO regression model.
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
Variables
Response:
- ETOTLCA2 -> ETHANOL: average daily volume of ethanol consumed in past year (ounces).
Explanatory:
- AGE -> AGE: age (years).
- S1Q24LB -> WEIGHT: weight (pounds).
- NUMPERS -> HOUSE_PEOPLE: number of persons in household.
- S1Q4A -> MARRAIGE: age at first marriage (years).
- S1Q8D -> WORK: age when first worked full time, 30+ hours a week (years).
- S1Q12A -> INCOME: total household income in last 12 months (dolars).
- SEX -> MALE: gender (2 groups).
- S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups).
All used variables are quantitative.
%pylab inline
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
import sklearn.metrics
from sklearn.linear_model import LassoLarsCV
#Visualization
import matplotlib.pylab as plt
import seaborn as sns
pylab.rcParams['figure.figsize'] = (15, 8)
Data
# Load data
data = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['ETOTLCA2','AGE','S1Q24LB','NUMPERS','S1Q4A','S1Q8D','S1Q12A','SEX','S10Q1A63'])
# Custom dataframe
df = pd.DataFrame()
# Response variable
df['ETHANOL'] = data['ETOTLCA2'].replace(' ',np.NaN).astype(float)
# Explanatory variables
df['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)
df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)
df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)
df['MARRIAGE'] = data['S1Q4A'].replace(' ',np.NaN).replace('99',np.NaN).astype(float)
df['WORK'] = data['S1Q8D'].replace(' ',np.NaN).replace('99',np.NaN).replace('0',np.NaN).astype(float)
df['INCOME'] = data['S1Q12A'].replace(' ',np.NaN).astype(float)
df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype(float)
df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype(float)
df = df.dropna()
df.describe()
TARGET = 'ETHANOL'
PREDICTORS = list(df.columns)
PREDICTORS.remove(TARGET)
df_target = df[TARGET]
df_predictors = pd.DataFrame()
Standardize predictors
- 0 Mean
- 1 Standadrd deviation
for predictor in PREDICTORS:
pred_data = df[predictor]
df_predictors[predictor] = (df[predictor] - df[predictor].mean()) / df[predictor].std()
df_predictors.describe()
Split: train, test
train_target, test_target, train_predictors, test_predictors = train_test_split(df_target, df_predictors, test_size=0.3, random_state=42)
print('Samples train: {0}'.format(len(train_target)))
print('Samples test: {0}'.format(len(test_target)))
Model
model1 = LassoLarsCV(cv=10,precompute=False)
model1.fit(train_predictors, train_target)
print('Alpha parameter: {0}'.format(model1.alpha))
Regression coeficients
coefs = zip(df_predictors.columns,model1.coef_)
coefs.sort(key=lambda x: abs(x[1]), reverse=True)
print '\n'.join( '{0}: {1}'.format(var,coef) for var,coef in coefs)
Plots
# plot coefficient progression
m_log_alphas = -np.log10(model1.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model1.coef_path_.T)
plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k',label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model1.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model1.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model1.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k', label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
Metrics
# MSE from training and test data
print ('MSE training: {0}'.format(sklearn.metrics.mean_squared_error(train_target, model1.predict(train_predictors))))
print ('MSE testing: {0}'.format(sklearn.metrics.mean_squared_error(test_target, model1.predict(test_predictors))))
# R-square from training and test data
print ('R-square training: {0}'.format(model1.score(train_predictors,train_target)))
print ('R-square testing: {0}'.format(model1.score(test_predictors,test_target)))
Summary
In this assignment, the LASSO regression hasn't proved to be very valuable as the model can only explain a 4,8% of the variance (R-Squared value). It is surprising this value is higher in the test dataset than in the training one. It happens the same with the Mean Squared Error, it is lower in the testing dataset.
The prediction accuracy is pretty stable as the metric values are similar in training and testing datasets.
The most important variable to predict the alcohol ingest is MALE: 0.25510029979, followed by HOUSE_PEOPLE: -0.0664057742071 CHANGE_MIND: 0.0554624412918, WEIGHT: -0.0550868686148 and WORK: -0.0437114356411. Surprisingly, the income is not as relevant as the others but it is not discarded.
One bad thing about the results is getting an Alpha value of 0, this means no regularization has been done and an Ordinary Least Squares regression has been performed. This can be verified as no predictor has been discarded, every one has a coefficient bigger than one.
The LASSO regression is useful when there are few observations and a large number of predictors so it can be used for dimensionality reduction.
No comments:
Post a Comment