Friday, 2 September 2016

Data Preprocessing

Author: Luv Singh  

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Abraham Lincoln

As a data scientist you will spend most of your time cleaning your data and transforming it into a format most suitable for the learning algorithm. It is not as glam as we would have had you believe. Just like cooking, the more time you spend processing your ingredients, better the chef you become.

Types of data

Machine Learning is used for all kinds of applications like guessing what an image contains, speech recognition, video tagging, natural language processing and number crunching - It might give the impression that multimedia data is fed into an algorithm as it is.

But most algorithms are implemented using matrix manipulations, and even complex multimedia formats are represented as plain numbers in the end.
For example for image recognition, one would convert the image into a tuple signifying the rgb numerical value at each pixel.

When the data is a count or measure of something and can take values potentially over the entire numerical range.
For example: Stock prices, the number of visitors to a store

Akin to Enumerated data in code. Where the range of values is small and fixed. The numbers themselves specify nothing and we cannot use normal mathematical operations on them.
For example: Representing gender as 0 and 1 for male and female respectively but it has no meaning to say male < female or male + female = female.

Ordinal is a special case of Categorical where the numerical data actually signifies something.
Example: Uber cab rating on the scale of 1 to 5 where 1 is less than 5.


Feature Scaling
The most commonly used operation is the Data Normalization or Feature scaling. Imagine we are trying to find some kind of pattern of income of an individual with respect to the age of that person. While income can have a range from thousands to millions or more while the age one would expect to be at the most a 100 or so. In which case, the income would overwhelm the age whereas ideally you would want both the features to affect the result in proportion to their ranges.

Scaling for Multimedia formats
For a class of problems such as character recognition, face recognition .. the image is scaled down or up or stretched and sheared to match the dimensions of the training data.
Similarly, audio data might require the same sample rates as training.

For a variety of machine learning algorithms like SVM and Neural Networks, it is recommended to have the values of each feature in the data have zero-mean.
The general method of calculation is to determine the mean and standard deviation for each feature. Then we subtract the mean from each feature. Then we divide the values by its standard deviation.

Feature Processing and Selection

In the perfect universe, your data would contain more features than you need for your particular problem and you utilize domain expertise to weed out unnecessary features. While in most cases, including such features has no effect on the result except that the learning algorithm runs slower but discarding such features also has the advantage of being able to narrow down the cause of errors if any and visualization.
But in feature selection, we would caution against more kinds of problem with feature selection like when your data comes from various sources, how would you decide which column in your database is comparable to which other column from another data source? They might not use the same scales and the same terminology.
Then there is the complexity involving missing data and how to attach default values to such data or just leave it as undefined. This would require transforming the data from different sources to a structured standard format to be consumed by the learning algorithm.

As the bulk of data being processed and transformed into a more structured form increases, the conventional approach of Extract Transform Load is not suitable and which is where  Extract Load transform comes to the fore where ELT takes in the raw data as is and then uses the power of clusters such as Hadoop to transform the raw data to structure.

Visualization of results on a graph or contour maps can really help a data scientist refine the model but us humans are limited to visualizing only 3 dimensions. Therefore, there are some techniques available that can map the features of n dimensions to preferably something that can be visualized on a graph.
PCA also solves another problem. When the number of features are more, the model requires more data to train and avoid the problem of overfitting. But as PCA reduces the number of features, the amount of data required reduces exponentially. This is especially true for Neural Networks.
Covariance and Correlation of features
Covariance can show if 2 features are related to each other. For example pressure and temperature show a negative covariance.
A word of caution though, in the above case we know from Boyle’s Law that increase in pressure causes an increase in Temperature, that is there is a causation relation between the features but it may not always be true. In one of the example above of features age and income may find a covariance but does not imply one causes the other.
A higher Covariance magnitude signifies that the features are correlated but there is no upper bound, so how does one define how high is enough to declare if 2 features are related?

Enter Correlation,
Correlation is a derived value of covariance where the value is between -1 and 1. The closer the value to 1, the more the covariance.

A note on use of covariance
Suppose, you are provided data for pricing of a house, with its length, breadth, area, number of rooms and the mean income of people living in that locality. You might find that data sufficient to write a predictive model to find the expected price of a house given that data.
But you do a covariance test and find a strong relation between length of the house, and the area and also between the width of a house and the area.
Indeed this is a simple example, and you would intuitively know that you can either include in your model length and breadth or the area. Including both, would be redundant.

In more complicated problems, it might not be straightforward and needs domain expertise to make sense of covariance.

Feature Transformation

Instead of using the feature values as is, sometimes a conversion or decomposition is required. For example separating a feature of type timestamp into date and time or to convert the units of temperature from fahrenheit to celsius because the other features are in SI units.

A case study : Character recognition

Neural networks are inspired by how the human brain works but the one place they differ from humans is the sheer size of training data needed that the human brain is able to adjust without quite any comparable size especially in case of pattern recognition.
There are plenty of implementation out there that solve this problem using a huge amount of data and have impressive accuracy, but my personal motivation in the problem is to use as little data as possible and still be able to get reasonable accuracy.

Suppose my training data looks like this:

The accuracy of the neural network trained on this data seems to take a heavy beating if the stroke size is wider. I am currently experimenting with transforming a stroke of any width size to something comparable to my training data and then watch the effect on the results.
The implementation at the time of the blog is very basic and is just there to illustrate the concept of data preprocessing and in the future post I will improve upon it and also add the recognition network to get a better idea.

There will be another future post that will deal with the data collection itself, but that is more relevant in context of interpreting results and for now feel free to play around with it here.

To know more about how we can offer consultation in Machine Learning, write to me at:


  1. I'm requested typically concerning the ins and outs of Amazon Internet Companies by C-levels, administrators and managers. They don't seem to be searching for nitty gritty nuances of scripting with the API of the Elastic Compute Cloud, they're simply within the normal overview of how the 'cloud' works.This is great blog. If you want to know more about this visit here AWS Cloud Certification.

  2. TreasureBox is operated by a group of young, passionate, and ambitious people that are working diligently towards the same goal - make your every dollar count, as we believe you deserve something better.
    rabbit hutch nz
    bed frames nz
    cheap outdoor furniture nz


Amazon EKS - Kubernetes on AWS

By Komal Devgaonkar Amazon Elastic Container Service for Kubernetes (Amazon EKS), which is highly available and scalable AWS service....