Preprocessing

Now that we have cleaned the data into an organized format, we can proceed with preprocessing, i.e., imputing missing values, encoding categorical features and scaling the data if required.

Imputation

First look at what features have missing values in both train and test.
```
df = pd.concat([train_df, test_df])
df.isna().sum()

# Output

# Survived     418
# Pclass         0
# Sex            0
# Age          263
# SibSp          0
# Parch          0
# Ticket         0
# Fare           1
# Embarked       2
# Honorific      0
```
Ignore the missing values in the Survived feature. The test set does not have any Survived values so all the missing values come from there.

The only features which have missing values are Age, Fare and Embarked.
Age missing values can be imputed by looking at the honorifics, now that we have seen honorifics do differentiate between age ranges.

Fare and Embarked have only 1 and 2 missing values respectively so we can impute them manually on a case by case basis.
```
for title in df.Honorific.unique():
    print(f'{title}: ', df[df.Honorific == title].Age.mean())

# Output

# Mr:  32.77566225165563
# Mrs:  37.04
# Miss:  21.8243661971831
# Master:  6.084814814814815
```
We have obtained the mean age of people based on honorifics so let's impute Age with these values.
```
df.Embarked.value_counts()

# Output

# S    914
# C    270
# Q    123
```
Southampton had the most number of passengers so let's impute the two missing Embarked values with 'S'.
```
df[df.Fare.isna()]

# Output

# Survived	Pclass	Sex	    Age	    SibSp	Parch	Ticket	Fare	Embarked	Honorific

# NaN	    3	    male	60.5	0	    0	    3701	NaN	    S	        Mr
```

Unfortunately there is no other passenger with ticket '3701' so we can't discover the exact Fare for this ticket so easily. But Fare must depend on Pclass to a degree because tickets of different passenger classes must cost different, so let's impute the Fare missing value with the mean of fares belonging to Pclass 3.

df.corr()['Fare']['Pclass']

# Output

# -0.5586287323271726

print(df[df.Pclass == 3].Fare.mean())
print(df[df.Pclass == 2].Fare.mean())
print(df[df.Pclass == 1].Fare.mean())

# Output

# 13.302888700564973
# 21.1791963898917
# 87.50899164086688

Now make a function to impute missing values.

def impute(df):
    df.Embarked = df.Embarked.fillna('S')
    df.Fare = df.Fare.fillna(13.3)

    df.loc[(df.Honorific == 'Mr'), 'Age'] = df.loc[(df.Honorific == 'Mr'), 'Age'].fillna(32.8)
    df.loc[(df.Honorific == 'Mrs'), 'Age'] = df.loc[(df.Honorific == 'Mrs'), 'Age'].fillna(37)
    df.loc[(df.Honorific == 'Miss'), 'Age'] = df.loc[(df.Honorific == 'Miss'), 'Age'].fillna(21.8)
    df.loc[(df.Honorific == 'Master'), 'Age'] = df.loc[(df.Honorific == 'Master'), 'Age'].fillna(6.1)
    
    return df

train_df = impute(train_df)
test_df = impute(test_df)

Now that Age missing values have been imputed, Honorific does not provide any information we don't already have. Indeed, the only information it provides is regarding gender and age range, but we already have features Sex and Age for those. Let's drop it.

train_df.drop(['Honorific'], axis=1, inplace=True)
test_df.drop(['Honorific'], axis=1, inplace=True)

Handling tickets

The Ticket feature also seems to be having a lot of different values so let's do for it what we did for Surname: find the number of tickets present in train set but not in test set, test set but not in train set, and in both train and test sets.

print('Only train: ', len(set(train_df.Ticket.unique()).difference(set(test_df.Ticket.unique()))))
print('Only test: ', len(set(test_df.Ticket.unique()).difference(set(train_df.Ticket.unique()))))
print('Both train and test: ', len(set(train_df.Ticket.unique()).intersection(set(test_df.Ticket.unique()))))

# Output

# Only train:  566
# Only test:  248
# Both train and test:  115

Once again, we see that very few tickets are in both train and test sets, while the test set has a lot of tickets that aren't in train set. Target Encoding will not work, so let's try to extract any meaningful information we can and drop it.

The only meaningful information Ticket can provide is about how many passengers bought the same ticket, i.e., how many passengers are in the same group. Let's aggregate the tickets and make a new feature indicating the size of the group each passenger is in and drop the original Ticket feature

df = pd.concat([train_df, test_df])

df['Ticket_Group'] = df.groupby('Ticket')['Ticket'].transform('count')
df.drop(['Ticket'], axis=1, inplace=True)

train_df = df.loc[train_df.index, :]
test_df = df.loc[test_df.index, :].drop(['Survived'], axis=1)

train_df.info()

# Output

#     Column        Non-Null Count  Dtype  
---   ------        --------------  -----  
# 0   Survived      891 non-null    float64
# 1   Pclass        891 non-null    int64  
# 2   Sex           891 non-null    object 
# 3   Age           891 non-null    float64
# 4   SibSp         891 non-null    int64  
# 5   Parch         891 non-null    int64  
# 6   Fare          891 non-null    float64
# 7   Embarked      891 non-null    object 
# 8   Ticket_Group  891 non-null    int64

Encoding

print(train_df.Sex.unique())
print(train_df.Embarked.unique())

# Output

# ['male' 'female']
# ['S' 'C' 'Q']

Now the only features in our dataset that are categorical are Sex and Embarked. Both have very few categories so they can be One Hot encoded. Do this with the pandas get_dummies method.

def encode(df):
    temp_1 = pd.get_dummies(df.Sex)
    temp_2 = pd.get_dummies(df.Embarked)
    df = df.join([temp_1, temp_2])
    df.drop(['Sex', 'Embarked'], axis=1, inplace=True)
    return df

df = pd.concat([train_df, test_df])
df = encode(df)
train_df = df.loc[train_df.index, :]
test_df = df.loc[test_df.index, :].drop(['Survived'], axis=1)

Now the data has been fully cleaned and preprocessed. I won't be scaling the features because I am going to use tree based estimators, or even in the case of neural networks I can use BatchNormalization.

Preprocessing

Imputation​

Handling tickets​

Encoding​

Imputation

Handling tickets

Encoding