Preprocessing
The data was first cleaned and preprocessed to handle missing values, categorical features, outliers, class imbalance and redundant features.
-
Encoding, Imputation, Scaling, Transformation: OneHotEncoder was used for encoding categorical features and IterativeImputer was used for imputing - this means NaN categorical values were converted to a separate feature instead of being imputed. After several experiments, this was found to be optimal.
MinMaxScaler was used for scaling the features to lie between 0 and 1, QuantileTransformer was used to transform the features to have a normal distribution.
A pipeline was made for the entire process:
encoder = make_column_transformer((OneHotEncoder(drop='first', sparse=False), x_train.select_dtypes(include='object').columns), remainder='passthrough', verbose_feature_names_out=False)
pipe = make_pipeline(encoder, IterativeImputer(random_state=42), MinMaxScaler(), QuantileTransformer(output_distribution='normal', random_state=42))
x_train = pd.DataFrame(pipe.fit_transform(x_train), columns=encoder.get_feature_names_out())
x_test = pd.DataFrame(pipe.transform(x_test), columns=encoder.get_feature_names_out()) -
Outlier Detection and Removal: Outliers were detected with IsolationForest (a RandomForest method) and removed.
data = pd.concat([x_train, y_train], axis=1)
outlier = IsolationForest(random_state=42)
clf = outlier.fit_predict(data)
clf_df = pd.DataFrame(clf, columns=['score'])
data = pd.concat([data, clf_df], axis=1)
data = data[data['score'] == 1]
data = data.drop(['score'], axis=1)
x_train = data.iloc[:, :-1]
y_train = data.iloc[:, -1]
y_train = y_train.astype('int') -
Class Rebalancing: It was found that classes 1 and 0 (visitors who bought something or didn't) were not equally represented in the train dataset, with more 0's. Therefore, classes were rebalanced with Oversampling.
resampler = BorderlineSMOTE(random_state=42)
x_train_resample, y_train = resampler.fit_resample(x_train, y_train)
x_train = pd.DataFrame(x_train_resample) -
Feature Selection: After a number of experiments, it was found that simply removing the feature with variance lower than 0.3 was optimal.
A feature having very low variance indicates that feature is mostly constant and not varying much, making it redundant. In this case, variance of 0.3 was found to be appropriate.
selector = VarianceThreshold(threshold=0.3)
x_train = pd.DataFrame(selector.fit_transform(x_train), columns=selector.get_feature_names_out())
x_test = pd.DataFrame(selector.transform(x_test), columns=selector.get_feature_names_out())