Preprocessing
Now that we have cleaned the data into an organized format, we can proceed with preprocessing, i.e., imputing missing values, encoding categorical features and scaling the data if required.
Imputation
-
First look at what features have missing values in both train and test.
df = pd.concat([train_df, test_df])
df.isna().sum()
# Output
# Survived 418
# Pclass 0
# Sex 0
# Age 263
# SibSp 0
# Parch 0
# Ticket 0
# Fare 1
# Embarked 2
# Honorific 0Ignore the missing values in the Survived feature. The test set does not have any Survived values so all the missing values come from there.
The only features which have missing values are Age, Fare and Embarked.
-
Age missing values can be imputed by looking at the honorifics, now that we have seen honorifics do differentiate between age ranges.
Fare and Embarked have only 1 and 2 missing values respectively so we can impute them manually on a case by case basis.
for title in df.Honorific.unique():
print(f'{title}: ', df[df.Honorific == title].Age.mean())
# Output
# Mr: 32.77566225165563
# Mrs: 37.04
# Miss: 21.8243661971831
# Master: 6.084814814814815We have obtained the mean age of people based on honorifics so let's impute Age with these values.
df.Embarked.value_counts()
# Output
# S 914
# C 270
# Q 123Southampton had the most number of passengers so let's impute the two missing Embarked values with 'S'.
df[df.Fare.isna()]
# Output
# Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked Honorific
# NaN 3 male 60.5 0 0 3701 NaN S Mr
Unfortunately there is no other passenger with ticket '3701' so we can't discover the exact Fare for this ticket so easily. But Fare must depend on Pclass to a degree because tickets of different passenger classes must cost different, so let's impute the Fare missing value with the mean of fares belonging to Pclass 3.
df.corr()['Fare']['Pclass']
# Output
# -0.5586287323271726
print(df[df.Pclass == 3].Fare.mean())
print(df[df.Pclass == 2].Fare.mean())
print(df[df.Pclass == 1].Fare.mean())
# Output
# 13.302888700564973
# 21.1791963898917
# 87.50899164086688
-
Now make a function to impute missing values.
def impute(df):
df.Embarked = df.Embarked.fillna('S')
df.Fare = df.Fare.fillna(13.3)
df.loc[(df.Honorific == 'Mr'), 'Age'] = df.loc[(df.Honorific == 'Mr'), 'Age'].fillna(32.8)
df.loc[(df.Honorific == 'Mrs'), 'Age'] = df.loc[(df.Honorific == 'Mrs'), 'Age'].fillna(37)
df.loc[(df.Honorific == 'Miss'), 'Age'] = df.loc[(df.Honorific == 'Miss'), 'Age'].fillna(21.8)
df.loc[(df.Honorific == 'Master'), 'Age'] = df.loc[(df.Honorific == 'Master'), 'Age'].fillna(6.1)
return df
train_df = impute(train_df)
test_df = impute(test_df)Now that Age missing values have been imputed, Honorific does not provide any information we don't already have. Indeed, the only information it provides is regarding gender and age range, but we already have features Sex and Age for those. Let's drop it.
train_df.drop(['Honorific'], axis=1, inplace=True)
test_df.drop(['Honorific'], axis=1, inplace=True)
Handling tickets
The Ticket feature also seems to be having a lot of different values so let's do for it what we did for Surname: find the number of tickets present in train set but not in test set, test set but not in train set, and in both train and test sets.
print('Only train: ', len(set(train_df.Ticket.unique()).difference(set(test_df.Ticket.unique()))))
print('Only test: ', len(set(test_df.Ticket.unique()).difference(set(train_df.Ticket.unique()))))
print('Both train and test: ', len(set(train_df.Ticket.unique()).intersection(set(test_df.Ticket.unique()))))
# Output
# Only train: 566
# Only test: 248
# Both train and test: 115
Once again, we see that very few tickets are in both train and test sets, while the test set has a lot of tickets that aren't in train set. Target Encoding will not work, so let's try to extract any meaningful information we can and drop it.
The only meaningful information Ticket can provide is about how many passengers bought the same ticket, i.e., how many passengers are in the same group. Let's aggregate the tickets and make a new feature indicating the size of the group each passenger is in and drop the original Ticket feature
df = pd.concat([train_df, test_df])
df['Ticket_Group'] = df.groupby('Ticket')['Ticket'].transform('count')
df.drop(['Ticket'], axis=1, inplace=True)
train_df = df.loc[train_df.index, :]
test_df = df.loc[test_df.index, :].drop(['Survived'], axis=1)
train_df.info()
# Output
# Column Non-Null Count Dtype
--- ------ -------------- -----
# 0 Survived 891 non-null float64
# 1 Pclass 891 non-null int64
# 2 Sex 891 non-null object
# 3 Age 891 non-null float64
# 4 SibSp 891 non-null int64
# 5 Parch 891 non-null int64
# 6 Fare 891 non-null float64
# 7 Embarked 891 non-null object
# 8 Ticket_Group 891 non-null int64
Encoding
print(train_df.Sex.unique())
print(train_df.Embarked.unique())
# Output
# ['male' 'female']
# ['S' 'C' 'Q']
Now the only features in our dataset that are categorical are Sex and Embarked. Both have very few categories so they can be One Hot encoded. Do this with the pandas get_dummies method.
def encode(df):
temp_1 = pd.get_dummies(df.Sex)
temp_2 = pd.get_dummies(df.Embarked)
df = df.join([temp_1, temp_2])
df.drop(['Sex', 'Embarked'], axis=1, inplace=True)
return df
df = pd.concat([train_df, test_df])
df = encode(df)
train_df = df.loc[train_df.index, :]
test_df = df.loc[test_df.index, :].drop(['Survived'], axis=1)
Now the data has been fully cleaned and preprocessed. I won't be scaling the features because I am going to use tree based estimators, or even in the case of neural networks I can use BatchNormalization.