Understand Feature Engineering for Machine Learning in 5 minutes

Florian Bouron
Artificial Intelligence in Plain English
4 min readJul 23, 2021

--

In this article, we will see what is feature engineering and how you can apply it to your machine learning algorithms.

Chemist representing feature engineering
Chemist representing feature engineering

Introduction

Before we move further, we need to define what is a feature in machine learning.

If you are new to machine learning, a feature is an input of a machine learning algorithm.

How machine learning models work
How machine learning models work

What is Feature Engineering?

Feature engineering extracts useful features from raw data using math, statistics, and domain knowledge.

For example, if a ratio of two numeric features is important to classifying an instance, then calculating that ratio and including it as a feature may improve the model quality.

This means that you could have two features: square meter and price of flats. You might need to create a feature by getting the price per square meter to improve your model.

Feature Engineering Example
Feature Engineering Example

How to do feature engineering?

Let’s see different strategies of feature engineering. In this article, we won’t see all the methods, but the most popular ones.

Adding and dropping features:

Let’s assume we do have the following features:

Price of houses
Price of houses

If we want to predict the price of a flat, the number of plants might be irrelevant. In that case, we need to remove this feature from our machine learning model to don’t add extra noise.

This noise is called the curse of dimensionality. This means that as the number of features in the data increases, the number of data points required to build a good model grows exponentially.

We need to choose which features have are the most relevant to our model.

Combining multiple features into one feature:

Price of houses
Price of houses

In the example above, we can see that square meters and square feet are actually the same data but not the same unit. If we give this to our algorithm, it will have to understand that the square meter and square foot are related and are actually the same feature.

That’s why we need to decide on which measurement to take and keep only one.

We could also have two features, number of dogs and number of dogs, and combine them under the number of animals.

Number of animals
Number of animals

Though, combining the features is not every time a good idea. For example, in the case of a date feature, probably the day of the week matter.

You need to remember that quality is better than quantity.

Cleaning existing features:

You need to keep the features that you think are relevant for your model picking up on the right signal in the data.

To do that you can:

  • Impute missing values.
  • Remove outliers for not trying to train with data points that are not representative.
  • Getting rid of the scales, for example, if you have features in centimeters and some other ones in meters, try to convert all of them in centimeters. This is called normalization.
  • Transform skewed data to make it more compact for our model thanks to an easier distribution.

Binning:

Binning is when you take a numerical measurement and convert it into a category.

Here is an example for home sales:

Home sales
Home sales

In that example, we can assume that the sale price depends on the fact that there is a swimming pool.

We can then simplify our model by pre-processing the data and replacing the swimming pool length with a boolean future.

Swimming Pool boolean
Swimming Pool boolean

One-hot encoding:

One-hot encoding is a way to represent categorical data in a way that the machine learning algorithm will understand.

Our model understands numbers but not strings, that’s why we need to convert strings to numbers. Though, we cannot assign random numbers to our strings, because our model might give more importance to big numbers than little numbers. That’s why we are going to use a one-hot encoding.

Here is an example about home sales:

One-hot encoding
One-hot encoding

One-hot encoding is useful for replacing categorical data with simple numeric data that the machine learning model will understand.

Summary

Feature engineering will help you to:

  • Solve the proper business cases issues thanks to the proper features.
  • Improve the performance of your machine learning algorithm.

I hope you enjoyed the read and if you want to see my next articles, feel free to follow me on Medium.

--

--