Skip to main content

Preprocessing

The data was first cleaned and preprocessed to handle missing values, categorical features, outliers, class imbalance and redundant features.

  1. Encoding, Imputation, Scaling, Transformation: OneHotEncoder was used for encoding categorical features and IterativeImputer was used for imputing - this means NaN categorical values were converted to a separate feature instead of being imputed. After several experiments, this was found to be optimal.

    MinMaxScaler was used for scaling the features to lie between 0 and 1, QuantileTransformer was used to transform the features to have a normal distribution.

    A pipeline was made for the entire process:

    encoder = make_column_transformer((OneHotEncoder(drop='first', sparse=False), x_train.select_dtypes(include='object').columns), remainder='passthrough', verbose_feature_names_out=False)
    pipe = make_pipeline(encoder, IterativeImputer(random_state=42), MinMaxScaler(), QuantileTransformer(output_distribution='normal', random_state=42))

    x_train = pd.DataFrame(pipe.fit_transform(x_train), columns=encoder.get_feature_names_out())
    x_test = pd.DataFrame(pipe.transform(x_test), columns=encoder.get_feature_names_out())
  2. Outlier Detection and Removal: Outliers were detected with IsolationForest (a RandomForest method) and removed.

    data = pd.concat([x_train, y_train], axis=1)
    outlier = IsolationForest(random_state=42)
    clf = outlier.fit_predict(data)
    clf_df = pd.DataFrame(clf, columns=['score'])
    data = pd.concat([data, clf_df], axis=1)
    data = data[data['score'] == 1]
    data = data.drop(['score'], axis=1)
    x_train = data.iloc[:, :-1]
    y_train = data.iloc[:, -1]
    y_train = y_train.astype('int')
  3. Class Rebalancing: It was found that classes 1 and 0 (visitors who bought something or didn't) were not equally represented in the train dataset, with more 0's. Therefore, classes were rebalanced with Oversampling.

    resampler = BorderlineSMOTE(random_state=42)
    x_train_resample, y_train = resampler.fit_resample(x_train, y_train)
    x_train = pd.DataFrame(x_train_resample)
  4. Feature Selection: After a number of experiments, it was found that simply removing the feature with variance lower than 0.3 was optimal.

    A feature having very low variance indicates that feature is mostly constant and not varying much, making it redundant. In this case, variance of 0.3 was found to be appropriate.

    selector = VarianceThreshold(threshold=0.3)
    x_train = pd.DataFrame(selector.fit_transform(x_train), columns=selector.get_feature_names_out())
    x_test = pd.DataFrame(selector.transform(x_test), columns=selector.get_feature_names_out())