12 docs tagged with "Machine Learning"

Cleaning

There are features like the name of the passenger and the cabin which cannot be used for analysis directly.

The dataset used for the competition consisted of features indicating amount of time spent on various pages of the site by the visitor, personal details of the visitor such as gender, marital status and education, and OS/search engine being used by the visitor.

Data

The Titanic Survival dataset is simple - it contains details of passengers including personal details (name, gender, age, family), passenger details (class, cabin, embarked from, fare of ticket) which are input features, and whether they survived, which is the target feature.

Estimation

Various models were tried for this problem, with the exception of deep neural networks, since tensorflow and pytorch were forbidden for the project/competition.

Estimation

To start with, I will use the RandomForest estimator and see how it does.

Feature Engineering

Let's divide the train data into x and y now that it has been cleaned and preprocessed.

Improvements

Since this was my first ever Kaggle competition and Machine Learning project, I was familiar with and could implement only the basics that I detailed. There were a lot more things I could have done.

Introduction

This project was for the course Machine Learning Practice.

Introduction

This is the beginner, introductory Kaggle competition that every new Kaggle member does. Since I had learned a lot of new techniques at the time, I decided to apply them all to this dataset as practice.

MLOps

The problem with is every step - cleaning, imputation, encoding, feature engineering, etc, is done separately so if a new test sample is given, one cannot directly make a prediction and will have to carry out every step all over again. To solve this, I am going to create a 'preprocessor' class with a transform method that does everything I have done until now and make a pipeline with this preprocessor as the first step and the trained model clf as the second step.

Preprocessing

The data was first cleaned and preprocessed to handle missing values, categorical features, outliers, class imbalance and redundant features.

Preprocessing

Now that we have cleaned the data into an organized format, we can proceed with preprocessing, i.e., imputing missing values, encoding categorical features and scaling the data if required.