Pearson Correlation¶
Author: Samuel M.H. <samuel.mh@gmail.com>
Date: 31-01-2016
Instructions
Generate a correlation coefficient.
Note 1: Two 3+ level categorical variables can be used to generate a correlation coefficient if the the categories are ordered and the average (i.e. mean) can be interpreted. The scatter plot on the other hand will not be useful. In general the scatterplot is not useful for discrete variables (i.e. those that take on a limited number of values).
Note 2: When we square r, it tells us what proportion of the variability in one variable is described by variation in the second variable (a.k.a. RSquared or Coefficient of Determination).
What to submit:
Following completion of the steps described above, create a blog entry where you submit syntax used to generate a correlation coefficient (copied and pasted from your program) along with corresponding output and a few sentences of interpretation.
Dataset
- National Epidemiological Survey on Alcohol and Related Conditions (NESARC)
- CSV file
- File description
%matplotlib inline
import pandas
import matplotlib.pyplot as plt
import scipy
import seaborn as sns
Test of correlation
I am testing if there is a linear relationship between the age (years) and the average height (cm) in the population of white people between 20 and 50 years old.
Ingesting and curating the data
#Load
data = pandas.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['AGE','S1Q24FT','S1Q24IN','ETHRACE2A'])
#Select
data = data[(data['AGE']>=20) & (data['AGE']<=50) & (data['ETHRACE2A']==1)]
data = data.dropna()
print(data.shape)
So far, we have 13.190 samples.
#Create the dataframe with the required features
df1 = pandas.DataFrame()
df1['AGE'] = data['AGE']
df1['HEIGHT'] = (data['S1Q24FT']*12 + data['S1Q24IN'] )* 2.54
Detecting outliers
df1.describe()
Obviously there are measurement errors (a person cannot be 32m height), lets say there is no person taller than 250 cm.
df1 = df1[ df1['HEIGHT']<=250 ]
print('Corrected Max height: {0}cm'.format(df1['HEIGHT'].max()))
Aggregating by year
Now I compute the average height per year group.
averages = df1.groupby('AGE', as_index=False).mean()
sns.lmplot(x='AGE', y='HEIGHT', data=averages)
A negative linear correlation can be seen in the plot, but... how significant is it?
Linear Correlation
I calculate the pearson correlation between the age and the height.
r,p = scipy.stats.pearsonr(averages['AGE'],averages['HEIGHT'])
print('Correlation coefficient (r): {0}'.format(r))
print('p-value: {0}'.format(p))
There is a negative weak correlation and the p-value is bigger than 0.05 so it is safe to say that there is NO statistically significant linear correlation between the age and the height. The weak correlation means that it could happen by chance.
No comments:
Post a Comment