I am in awe of Scikit-learn. I have always used scikit-learn for data modeling and metrics. But I had no idea how diverse and useful this library can be! So, this post is all about gazing and praising Scikit-learn.
What is so special about scikit-learn?
Now, if you are the skeptical one and have not yet explored its possibilities, you must be thinking ‘what is this fuss all about?‘ So, let me tell you my reasons for this obsession.
- Scikit-learn covers all the aspects of machine learning. Trust me!
- A very detailed documentation
- Easy to use
Still not convinced? Now, its time to demonstrate its versatility.
Scikit-learn & Data:
Data is the essence of any machine learning algorithm, right? Now, real-world data is messy. At times, we may need to scrape some websites or collect data using surveys or check the ERP modules for tables or just build data slowly over time. But sometimes, we need data to learn, practice, or explore machine learning algorithms. For that purpose, scikit-learn has an easy-to-apply solution.
Built-in Dataset:
The scikit-learn library has many built-in datasets that are simple and popular for learning. So, we just need to import the datasets module, and voila! Here is an example:
import sklearn.datasets as datasets house_dtls = datasets.load_boston() print(house_dtls.data.shape) print(house_dtls.feature_names)
Real World Dataset:
Scikit-learn also provides tools to load larger datasets, downloading them if necessary. For example:
from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_train.data[:1]
Generate a Dataset:
Also, Scikit-learn can help in generating an artificial dataset for classification, clustering, or regression problems. Now, that is amazing, right?
Classification:
X,y = datasets.make_classification(n_features=20, n_samples=100, n_redundant=0, n_informative=5, n_clusters_per_class=1 ) print("The data X shape is {}".format(X.shape)) print("The data y shape is {}".format(y.shape))
The above piece of code will create a classification dataset with 100 records and 20 features. And, The output variable is binary. Well, there is another method to create a multilabel classification dataset. Let’s take a quick look.
A, b = datasets.make_multilabel_classification(n_classes=3, allow_unlabeled=True, random_state=1) print("The data X shape is {}".format(A.shape)) print("The data y shape is {}".format(b.shape))
Regression:
Now, we can generate a regression dataset using a method called make_regression, the same as the classification dataset. Here is an example:
X, y = datasets.make_regression(n_features=1, n_informative=1 ) print("The regression data X shape is {}".format(X.shape)) print("The regression data y shape is {}".format(y.shape))
Scikit-learn & Data Preprocessing
Now, as we have the Data, it is time to clean, scale or normalize the data which is known as Data Preprocessing. It is a very important step in machine learning since data doesn’t come in a ready-to-use pack. Thus, the sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for building a machine learning model. So, let’s take a look at how we can utilize sklearn for data preprocessing.
Working with numerical feature
Min-Max Scaler
If a numerical feature varies in range, it can be a problem since most of the Machine Learning algorithms use Euclidean distance as the metrics to measure the distance between two data points. So, we can make sure that the features are in the same range. One solution is using MinMaxScaler.
xi – min(x) / max(x) – min(x)
from sklearn.preprocessing import MinMaxScaler data = [[-1, 72], [-0.5, -6], [90, 10], [10, 188]] scaler = MinMaxScaler() print("Fit: ", scaler.fit(data)) print("Max: ", scaler.data_max_) print("Transform: " , scaler.transform(data))
StandardScaler
Now, for StandardScaler, the mean and standard deviation are calculated on the feature we want to scale and then the following scaling function can be applied.
xi – mean(x) / stdev(x)
from sklearn.preprocessing import StandardScaler data = [[0, 0], [0, 0], [1, 1], [1, 1]] scaler = StandardScaler() print("Fit: ", scaler.fit(data)) print("Mean: ", scaler.mean_) print("Transform: " , scaler.transform(data))
Normalizer
Another important scaling function is Normalizer. It works on rows and not columns. It rescales each sample independent of other samples so that its norm (l1, l2 or inf) equals unit norm. Now, the Unit norm with L2 means that if each element were squared and summed, the total would equal 1.
from sklearn.preprocessing import Normalizer X = [[4, 1, 2, 2], [1, 3, 9, 3], [5, 7, 5, 1]] transformer = Normalizer().fit(X) transformer.transform(X)
Working with categorical features
There can be a lot of scenarios where the feature is not a numerical feature. For instance, the gender of a person or maybe a Yes/No feature or a multi-class feature.
Now, to create a model we convert the categorical feature to integers. In scikit-learn, there are a couple of options to do that.
OrdinalEncoder: This estimator transforms each categorical feature into one new feature of integers (0 to n_categories – 1)
import sklearn.preprocessing as preprocessing enc = preprocessing.OrdinalEncoder() X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] enc.fit(X) print("The original data") print(X) print("The transform data using OrdinalEncoder") print(enc.transform([['female', 'from US', 'uses Safari']]))
OneHotEncoder: So, this estimator transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them being 1, and all others 0.
import sklearn.preprocessing as preprocessing enc = preprocessing.OneHotEncoder() X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] enc.fit(X) print("The original data") print(X) print("The transform data using OneHotEncoder") print(enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe', 'uses Safari']]).toarray())
LabelEncoder: Now, this estimator can help encode labels with a value between zero and n_classes-1.
import sklearn.preprocessing as preprocessing import numpy as np targets = np.array(["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"]) labelenc = preprocessing.LabelEncoder() labelenc.fit(targets) targets_trans = labelenc.transform(targets) print("The original data") print(targets) print("The transform data using LabelEncoder") print(targets_trans)
Sklearn & Feature Selection
Well, Feature Selection is a very important step in machine learning. When we get a dataset in a tabular form, each and every column is a feature, right? But the question is which features are relevant.
What if we just use all the features without doing any feature engineering step? Now, that may seem like an easier solution, but the feature selection is important because of the following reasons:
- To reduce training time
- To make the Model simple and easy
- Reduce overfitting and hence the curse of dimensionality
There are a couple of ways to do feature selection and scikit-learn covers it all. Let’s take a quick look!
Select k-best features
SelectKBest can select k best features based on some metric. So, the core idea here is to calculate some metrics between the target and each feature, then sort them, and finally select the k best features.
The below code selects 20 best features using chi2 score.
from sklearn.datasets import load_digits from sklearn.feature_selection import SelectKBest, chi2 X, y = load_digits(return_X_y=True) print("Orginal Features Count: ", X.shape[1]) X_new = SelectKBest(chi2, k=20).fit_transform(X, y) print("Features count after using SelectKBest: ", X_new.shape[1])
VarianceThreshold
Scikit-learn provides VaranceThreshold to remove features with low variance. Now, what does that even mean? It means that if a feature has very little variety in its values, then it will not contribute much to the prediction of the target in a Model. In this below example, Feature3 has the same value for all the records. In other words, the variance of the feature3 is zero.
Feature1 | Feature2 | Feature3 |
1 | 4 | 5 |
79 | 2 | 5 |
65 | 0 | 5 |
210 | 12 | 5 |
32 | 6 | 5 |
Now, let’s see an example code:
import sklearn.feature_selection as fs import numpy as np X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]) var = fs.VarianceThreshold(threshold=0.2) var.fit(X) X_trans = var.transform(X) print("The original data") print(X) print("The processed data by variance threshold") print(X_trans)
Scikit-learn & Feature Extraction:
Feature extraction focuses on how to extract data from Text and Images. Now, the raw data is a sequence of symbols and cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors.
Analyzing Text:
As expected, Scikit-learn provides utilities for the most common ways to extract numerical features from text content. They are:
- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- counting the occurrences of tokens in each document.
- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
Now, this process of converting the raw text documents into numerical feature vectors is called vectorization.
from sklearn.feature_extraction.text import CountVectorizer counterVec = CountVectorizer() corpus is a list of string in this example, such as: corpus = [ "I have an apple.", "The apple is red", "I like the apple", "Apple is nutritous" ] counterVec.fit(corpus) corpus_data is a matrix with 0/1. corpus_data = counterVec.transform(corpus) print("Get all the feature names of this corpus") print(counterVec.get_feature_names()) print("The number of feature is {}".format(len(counterVec.get_feature_names()))) corpus_data = counterVec.transform(corpus) print("The transform data's shape is {}".format(corpus_data.toarray().shape)) print(corpus_data.toarray())
Analyzing Image:
There is no exact definition of the features of an image but things like the shape, size, orientation, etc. constitute the feature of the image. Extracting these features can be done using scikit-learn.
The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis.
import numpy as np from sklearn.feature_extraction import image one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3)) one_image[:, :, 0] # R channel of a fake RGB picture patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,random_state=0) print(patches.shape) print(patches[:, :, :, 0]) patches = image.extract_patches_2d(one_image, (2, 2)) print(patches.shape) print(patches[4, :, :, 0])
Conclusion:
So far, I was talking about the scikit-learn features which are important and useful. Apart from these, scikit-learn is widely used to train and build different machine learning models and to evaluate those models. Well, that can be a talk for another day.
In a nutshell, it is good to know, that if you are struggling with a dataset and not sure where to start from, just take a look at the scikit-learn documentation. I am sure it will have something useful to offer.
Also, I have uploaded the code that I ran above at my Github location. You can take a look if any of the above code doesn’t work for you.
So, now are you convinced that Scikit-learn is a one-stop solution in Python in the world of machine learning? Not yet? So, I need to cite some more evidence? Okay, I will be back soon with some projects where sklearn plays a major role.
For today, Thank You for reading!
0