Of Pipelines, Cross Validation and Hyperparameter Tuning - A tutorial using Scikit-Learn Part 1

Divine Saungweme explores pipelines, cv and hyperparameter tuning in SKLearn.

Hello, Data Scientists

In this episode, I’ll be demonstrating how powerful some 3 Sci-Kit-Learn features are and how we can use these features to prepare data, to choose models and to fine-tune the model without breaking any sweat because of the simplicity and automation.

The features are: - Transformation Pipelines - Cross Validation - And last but not least, Hyperparameter Tuning with Grid Search and Randomized Search

We will discover how these features can help us and why they are really worth putting in your Data science tool-kit.

We wil be using the Titanic dataset from Kaggle Competitions (Hope this episode won’t be a disaster too like the Titanic :)

Before we begin, if you are using a backward version of Sci-Kit-Learn you may have problems in importing some packages. In this tutorial, I am using version ‘0.21.3’.

You can use the following piece of code to see the version

import sklearn
sklearn.__version__
'0.21.3'

Let’s import the common imports

import numpy as np
import pandas as pd
np.random.seed(42)  # To get a stable output across all runs

With the Titanic dataset, we have to create a model that predicts which passengers survived the Titanic shipwreck. We have to predict what sort of people were likely survive using the Passenger information e.g name, gender, passenger class, etc….. So, now let’s load the data…

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

The data has already been split into Training and Testing data

Since this is a Kaggle Competition dataset, there are no labels in the Test data (with the attribute name ‘Survived’). We will just compile our predictions into a csv file (in respect of Kaggles’ formating) , upload the predictions (as a csv file) to Kaggle and see our final score.

train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Some insight about the data

The attributes have the following meaning: * Survived: that’s the target, 0 means the passenger did not survive, while 1 means he/she survived. * Pclass: passenger class. * Name, Sex, Age: self-explanatory * SibSp: how many siblings & spouses of the passenger aboard the Titanic. * Parch: how many children & parents of the passenger aboard the Titanic. * Ticket: ticket id * Fare: price paid (in pounds) * Cabin: passenger’s cabin number * Embarked: where the passenger embarked the Titanic

Let’s check for any missing data

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

As we can see, the attributes: Age, Cabin and Embarked have some null values and Cabin has the most null values.

We will not be using Cabin in this tutorial and we will also not use the Name and Ticket attributes.

We can easily use the Age and the Embarked attributes so we will transform them later.

Let take a sneak peak at the numerical attributes

train_data.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Let also take a sneak peak at the categorial attributes

train_data['Survived'].value_counts()
0    549
1    342
Name: Survived, dtype: int64
train_data['Pclass'].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
train_data['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
train_data['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

train_data_copy = train_data.copy()
# Use the "copy()" function call to avoid changing the original data (which in this case is "train_data")
# We have to poke around with transformations using the copy of train_data and see what we can archieve

train_data_copy.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Let’s start using Transformations.

Using the “so-called” regular way we would need to deal with the NaN values first

from sklearn.base import BaseEstimator, TransformerMixin

# Inspired from stackoverflow.com/questions/25239958
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

The class transforms both numerical and categorial (object) data. It replaces the NaN values in the data with the most frequent value in the Data Attribute. For example, If we had NaN values in the Sex attribute, we would replace the NaN values with the most frequent value in the Sex attribute (which in this case is Male).

We will use our Transformer Class (MostFrequentImputer) on categorial attributes. We will use Simple Imputer for numerical attributes. Let’s start with numerical attributes

from sklearn.impute import SimpleImputer
impute = SimpleImputer(strategy='median')
# Setting strategy to median indicates that we want to convert the NaN values to the median of the numerical attribute
impute.fit(train_data_copy[["Age"]])
train_data_copy[["Age"]] = impute.transform(train_data_copy[["Age"]])
train_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

As we can see, the values from our imputed attribute (Age) are now 891, the NaN values in the attribute have been wiped off from existance.

Now, let’s impute the categorial attributes with our Transform Class

cat_impute = MostFrequentImputer()
cat_impute.fit(train_data_copy[["Embarked"]])
train_data_copy[["Embarked"]] = cat_impute.transform(train_data_copy[["Embarked"]])
train_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Well, what do you know?. Now, only the Cabin attribute is left (because we didn’t include it in our transformation). We are not going to use it along with Name and Ticket, so let’s just drop these attributes

train_data_copy.drop(['Name', 'Cabin', 'Ticket'], axis=1, inplace=True)

Phew!!, we have finally got rid of the NaN values so what’s next. We want to scale the Numerical Attributes (some algorithms work better with scaled data). For categories we have to convert the categories from strings to usable numbers.

from sklearn.preprocessing import StandardScaler

num_attribs = ["Age", "SibSp", "Parch", "Fare", "Pclass"]
# Pclass is a category but it's already in numerical so scaling it would be a good idea than labeling it in this case
scaler = StandardScaler()
train_data_copy[num_attribs] = scaler.fit_transform(train_data_copy[num_attribs])
train_data_copy.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 0.827377 male -0.565736 0.432793 -0.473674 -0.502445 S
1 2 1 -1.566107 female 0.663861 0.432793 -0.473674 0.786845 C
2 3 1 0.827377 female -0.258337 -0.474545 -0.473674 -0.488854 S
3 4 1 -1.566107 female 0.433312 0.432793 -0.473674 0.420730 S
4 5 0 0.827377 male 0.433312 -0.474545 -0.473674 -0.486337 S

The selected attributes have been scaled, Now our models can predict much better, Now what’s left is tagging numerical labels to our Categorial attributes

from sklearn.preprocessing import LabelEncoder

labeler = LabelEncoder()
for cats in ["Sex", "Embarked"]: # cats is just short for categories, not actual cats ;)
    train_data_copy[cats] = labeler.fit_transform(train_data_copy[cats])
train_data_copy.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 0.827377 1 -0.565736 0.432793 -0.473674 -0.502445 2
1 2 1 -1.566107 0 0.663861 0.432793 -0.473674 0.786845 0
2 3 1 0.827377 0 -0.258337 -0.474545 -0.473674 -0.488854 2
3 4 1 -1.566107 0 0.433312 0.432793 -0.473674 0.420730 2
4 5 0 0.827377 1 0.433312 -0.474545 -0.473674 -0.486337 2

Finally we have finished preparing our data. What a lot of tiring work that was, Good news, you don’t have to tire yourself with this tedious technique.

Pipelines coming to the rescue…

Why should I care about Pipelines ????

As we have seen (and coded as well), we have many transformations that need to be executed in the right order, for example: We cannot scale data whilst we still have NaN values in the data (You end up getting many frustrating errors).

Pipelines take all the transformations and bind them together inorder to prepare/transform the data in the right order. All we should do is specify the order of executions by putting the transformation packages in the right order.

So how do go about setting up these so-called Pipelines, let’s dive right into the Pipeline…

from sklearn.pipeline import Pipeline # how ironic, importing Pipeline from pipeline ;)
from sklearn.preprocessing import OneHotEncoder
# In this one we will use OneHotEncoder instead of LabelEncoder as OneHotEncoder tends to do a better job than LabelEncoder

# We will make 2 Pipelines, one for Numerical Attributes and the other for Categorial Attributes

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    # The name "imputer" can be set to any string e.g "impute" or "whatever"
    ('scaler', StandardScaler())
])

# I had already imported these packages before so there's no need for repetition

cat_pipeline = Pipeline([
    ('imputer', MostFrequentImputer()), # from the class we had made earlier
    ('encoding', OneHotEncoder(sparse=False))
    # Setting sparse to False prevents the OneHotEncoder from returning a Scipy sparse matrix
])

There is even a much faster way of setting up a Pipeline. Sometimes we may have no need of naming each step taken in preprocessing. Using “make_pipeline”, we can save time although we would sacrifice some features a general Pipeline offers. You can use the following code below to set up the faster-to-setup Pipeline (although we are not going to use it, we will just stick to the general pipeline).

from sklearn.pipeline import make_pipeline

num_small_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_small_pipeline = make_pipeline(MostFrequentImputer(), OneHotEncoder(sparse=False))

# It's easy as that

More about OneHotEncoder

A OneHotEncoder creates binary columns (attributes) from the category

What does that mean??

For example: The Embarked category. The attributes in the category Embarked are: S; C; and Q.

We get 3 columns for this category (because it has 3 attributes stated earlier)

If the passenger embarked from S, the S column will have 1 (making it hot and also representing True) and the other columns (C and Q) will have 0 (making them cold and also representing False)

If the passenger embarked from C, the C column will have 1 (making it hot and also representing True) and the other columns (S and Q) will have 0 (making them cold and also representing False)

…..

So we will have extra columns..

# So we must seperate the train_data into 'Data' and 'Target'
# Only the 'Data' needs to be transformed with our Pipeline

train_set = train_data.drop('Survived', axis=1)
y_train = train_data['Survived']

Now that we have made our Pipelines, we will combine them with ColumnTransformer

from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
    ('num_pipeline', num_pipeline, ["Age", "SibSp", "Parch", "Fare"]),
    # We just choose the numerical attributes we want
    ('cat_pipeline', cat_pipeline, ["Pclass", "Sex", "Embarked"])
    # Here, we just choose the categorial attributes we want and here Pclass works better as a category than a number
])

X_train = full_pipeline.fit_transform(train_set)

# The transformation returns our data as a numpy array
# Only the attributes of numbers and categories that we have specified in the full_pipeline (which are just 7) will
# be present in the data so there is really no need of dropping attributes like we did before in our "Regular
# Transformation" detour because they have been automatically dropped.

X_train[:5] # Our numpy array's head, Similar with Pandas .head() function call :)
# Voila...
array([[-0.56573646,  0.43279337, -0.47367361, -0.50244517,  0.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ],
       [ 0.66386103,  0.43279337, -0.47367361,  0.78684529,  1.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
         0.        ,  0.        ],
       [-0.25833709, -0.4745452 , -0.47367361, -0.48885426,  0.        ,
         0.        ,  1.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [ 0.4333115 ,  0.43279337, -0.47367361,  0.42073024,  1.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [ 0.4333115 , -0.4745452 , -0.47367361, -0.48633742,  0.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ]])
# As you saw, Pipelines are easily managable than the whole transformation process we had earlier and Pipelines also
# take less time to set up

# Let's see how our data looks like as a DataFrame

# Pclass: 1, 2, 3
# Sex: Female, Male
# Embarked: S,C, Q

columns = ["Age", "SibSp", "Parch", "Fare", 'Pclass-1', 'Pclass-2', 'Pclass-3', 
           'Sex-Female', 'Sex-Male', 'Embarked-C', 'Embarked-Q', 'Embarked-S']

pd.DataFrame(X_train, columns=columns).head()
Age SibSp Parch Fare Pclass-1 Pclass-2 Pclass-3 Sex-Female Sex-Male Embarked-C Embarked-Q Embarked-S
0 -0.565736 0.432793 -0.473674 -0.502445 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
1 0.663861 0.432793 -0.473674 0.786845 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
2 -0.258337 -0.474545 -0.473674 -0.488854 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0
3 0.433312 0.432793 -0.473674 0.420730 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 0.433312 -0.474545 -0.473674 -0.486337 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
# We will start by using the Stochastic Gradient Descent Classifier
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)
X_test = full_pipeline.transform(test_data)
predictions = sgd_clf.predict(X_test)
#We convert our predictions to a csv file
predictions_df = pd.DataFrame(predictions)
predictions_df.to_csv('predictions.csv')

# Then you can submit the csv file

But, Wait a minute!!……

How can we get an idea of how our model performs, What if it turns out be very aweful. Luckily we don’t have to rely on guess-work.

Let’s try predicting the training data

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier().fit(X_train, y_train)
dt_pred = dt_clf.predict(X_train)
accuracy_score(y_train, dt_pred)
0.9797979797979798

Before you smile in satisfaction, those numbers are biased. The model is simply overfitting

Predicting the train data mostly results in Models overfitting the data (as we have seen), So how can we get a reliable score without worring about overfitting…

We can can use Cross Validation

def scores(score):
    print(score)
    print(f'\nMean: {score.mean()}')
    print(f'\nStandard Deviation: {score.std()}')
    print(f'\nMaximum: {score.min()}')
    print(f'Minimun: {score.max()}')
from sklearn.model_selection import cross_val_score

cvs = cross_val_score(sgd_clf, X_train, y_train, scoring='accuracy', cv=10)

scores(cvs)
[0.76666667 0.72222222 0.7752809  0.86516854 0.7752809  0.73033708
 0.75280899 0.78651685 0.80898876 0.78409091]

Mean: 0.7767361820451708

Standard Deviation: 0.03848773108423685

Maximum: 0.7222222222222222
Minimun: 0.8651685393258427

So what just happened, you may be wondering how we got the score

The cross_val_score randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.

The “cv” parameter determines how many folds we want our cross_val_score function to have (which in this case is 10).

NB. The cv should be greater than 1. cv > 1

The scoring parameter determines what kind of score we want to get, setting it to: - “accuracy” - gives us the accuracy score - “precision” - gives us the precision score - “recall” - gives us the recall score and - “f1” - gives us the f1 score…

cross_val_score was adapted from StratifiedKFold, this is how it looks when using StratifiedKFold.

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=10, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))
0.7666666666666667
0.7222222222222222
0.7752808988764045
0.8651685393258427
0.7752808988764045
0.7303370786516854
0.7528089887640449
0.7865168539325843
0.8089887640449438
0.7840909090909091

If we compare the results, they are pretty much the same, but using StratifiedKFold is a lot of work as compared to using cross_val_score.

Cross validation can be used to see how our model is able generalize data. If you do not have enough data to populate both the train and test sets, you can definitely use Cross Validation.

# And if you would like to get some predictions so that you can compare them with the y_train set you can simply:

from sklearn.model_selection import cross_val_predict

predictions = cross_val_predict(sgd_clf, X_train, y_train, cv=10)
accuracy_score(y_train, predictions)
0.77665544332211
# If you would like to get some Decision Functions you can simply:

predictions = cross_val_predict(sgd_clf, X_train, y_train, cv=10, method='decision_function')
predictions[:5]
array([-2.80887674,  2.61659364, -0.13381379,  1.82152503, -2.71375558])

If you are using a model that supports “Prediction Probabilities” you can simple set the method hyperparameter to “predict_proba”

# Let's try another model
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
cross_val_score(knn_clf, X_train, y_train, scoring='accuracy', cv=10).mean()
0.8069611848825332

As it turns out, the KNeighborsClassifier did a better job than the SGDClassifier, so it’s promising

Let just try one last model

# The other model
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
cross_val_score(rnd_clf, X_train, y_train, scoring='accuracy', cv=10).mean()
0.8115690614005221

The Random Forest Classifier did better than the other two but how can we optimise it, whilst also preventing overfitting

We can tweak the hyperparameters and see which ones get us somewhere…

# A very simple way of tweaking the n_neighbors parameter in KNeighborsClassifier is this way:

for number in range(1, 12):
    knn_looped = KNeighborsClassifier(n_neighbors=number).fit(X_train, y_train)
    score = cross_val_score(knn_looped, X_train, y_train, scoring='accuracy', cv=10).mean()
    print(number, score)
1 0.7554752581999773
2 0.7958001361933945
3 0.7980603790716151
4 0.8002698331630915
5 0.80697366927704
6 0.809170355237771
7 0.8058631256384066
8 0.7991212688684599
9 0.7946646237657474
10 0.8069611848825332
11 0.7980101577573487

So, as we can see, putting the n_neighbors as 6 tends to give us a higher accuracy score

But what can we do if we a lot of hyperparameters in the model that we need to test out and we need a much organised way of doing it.

We can use GridSearchCV

from sklearn.model_selection import GridSearchCV

params = [
    {'n_estimators': [100, 200, 300], 'max_features': [8, 9, 10, 11]}
]
# These are the parameters that we put in Random Forest model and test each and every combination
# So we have 12 combinations...

rnd_clf = RandomForestClassifier(random_state=42)
grid_rnd_clf = GridSearchCV(rnd_clf, params, cv=3, return_train_score=True, scoring='accuracy', verbose=2)

# The grid search takes the algorithm, parameters, folds/cv (number of trainings)

# The grid search undergoes cross validation, similar with the cross_val_score that we talked about earlier, it goes on...
# ... cross validating each and every combination we assigned it to

# So with the number of folds as 3, we can conclude that we will have 36 runs
# {(3 n_estimators) * (4 max_features) * (3 cv)} => 3 * 4 * 3 => 36

# 'verbose' gives us details of the runs, such as the time taken, etc, increasing the value increases the details...

grid_rnd_clf.fit(X_train, y_train)

# Fitting our Grid Search Model make take some time, maybe a few seconds
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] max_features=8, n_estimators=100 ................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ................. max_features=8, n_estimators=100, total=   0.2s
[CV] max_features=8, n_estimators=100 ................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[CV] ................. max_features=8, n_estimators=100, total=   0.2s
[CV] max_features=8, n_estimators=100 ................................
[CV] ................. max_features=8, n_estimators=100, total=   0.2s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total=   0.4s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total=   0.4s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total=   0.5s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total=   0.7s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total=   0.7s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total=   0.7s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total=   0.2s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total=   0.2s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total=   0.2s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total=   0.5s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total=   0.6s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total=   0.5s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total=   0.7s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total=   0.7s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total=   0.7s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total=   0.3s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total=   0.2s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total=   0.3s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total=   0.5s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total=   0.6s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total=   0.5s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total=   0.9s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total=   0.8s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total=   0.7s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total=   0.2s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total=   0.2s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total=   0.3s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total=   0.5s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total=   0.6s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total=   0.5s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total=   0.9s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total=   1.0s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total=   0.8s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:   20.3s finished
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid=[{'max_features': [8, 9, 10, 11],
                          'n_estimators': [100, 200, 300]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=2)
grid_rnd_clf.best_score_
0.8092031425364759
# Let's see the best combinations/parameters
grid_rnd_clf.best_params_
{'max_features': 10, 'n_estimators': 200}
# Let's see the whole parameters and their scores
data = grid_rnd_clf.cv_results_
for a, b in zip(data['mean_test_score'], data['params']):
    print(a, b)
0.8024691358024691 {'max_features': 8, 'n_estimators': 100}
0.8035914702581369 {'max_features': 8, 'n_estimators': 200}
0.8013468013468014 {'max_features': 8, 'n_estimators': 300}
0.8013468013468014 {'max_features': 9, 'n_estimators': 100}
0.8047138047138047 {'max_features': 9, 'n_estimators': 200}
0.8047138047138047 {'max_features': 9, 'n_estimators': 300}
0.8058361391694725 {'max_features': 10, 'n_estimators': 100}
0.8092031425364759 {'max_features': 10, 'n_estimators': 200}
0.8069584736251403 {'max_features': 10, 'n_estimators': 300}
0.8058361391694725 {'max_features': 11, 'n_estimators': 100}
0.8024691358024691 {'max_features': 11, 'n_estimators': 200}
0.8058361391694725 {'max_features': 11, 'n_estimators': 300}

It seems as if our best Grid Search model is worse than the regular Random Forest Classifier that we made the first time. This is because we have used a cv of 3 on the Grid Search Random Forest Classifier and a cv of 10 on the Regular Random Forest Classifier. Let’s cross_val_score our best model and see how it does with a cv of 10…

rnd_clf = grid_rnd_clf.best_estimator_ # grid_rnd_clf.best_estimator_ is our model
cross_val_score(rnd_clf, X_train, y_train, cv=10, scoring='accuracy').mean()
0.8227681307456589

As it turns out, the Grid Search Random Forest Classifier is actually better than the Regular Random Forest Classifier

And if you are curious:

RandomForestClassifier(random_state=42, n_estimators=200, max_features=10)

and

grid_rnd_clf.best_estimator_

Produce very slightly different results…

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'n_estimators': randint(low=100, high=200), 'max_features': randint(low=6, high=11)}


rnd_clf2 = RandomForestClassifier(random_state=42)
rnd_search = RandomizedSearchCV(rnd_clf2, param_distributions = param_distribs, n_iter=10, cv=3, scoring='accuracy',
                               random_state=42, verbose=2)
rnd_search.fit(X_train, y_train)

# Too much randomness on this one, Ehh :)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] max_features=9, n_estimators=192 ................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ................. max_features=9, n_estimators=192, total=   0.5s
[CV] max_features=9, n_estimators=192 ................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s
[CV] ................. max_features=9, n_estimators=192, total=   0.5s
[CV] max_features=9, n_estimators=192 ................................
[CV] ................. max_features=9, n_estimators=192, total=   0.5s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total=   0.4s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total=   0.4s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total=   0.4s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total=   0.3s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total=   0.3s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total=   0.3s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total=   0.4s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total=   0.4s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total=   0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total=   0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total=   0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total=   0.4s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total=   0.5s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total=   0.5s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total=   0.5s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total=   0.3s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total=   0.3s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total=   0.3s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total=   0.4s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total=   0.2s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total=   0.4s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total=   0.5s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total=   0.3s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total=   0.3s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total=   0.4s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total=   0.4s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total=   0.4s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   11.3s finished
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None,
                                                    oob_sc...
                                                    warm_start=False),
                   iid='warn', n_iter=10, n_jobs=None,
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BB2F0C8>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BB2FFC8>},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring='accuracy', verbose=2)

RandomizedSearchCV takes a dictionary with the parameters, it takes random numbers between the low and high rating and picks randomly the combinations.

So the difference between GridSearchCV and RandomizedSearchCV is:

With GridSearchCV you specify the parameters that you want to combine but with RandomizedSearchCV, the model automatically picks random parameters between the specified limits (which are low and high)

rnd_search.best_score_, rnd_search.best_params_
(0.8103254769921436, {'max_features': 10, 'n_estimators': 120})
# Let's see the whole parameters and their scores
data = rnd_search.cv_results_
for a, b in zip(data['mean_test_score'], data['params']):
    print(a, b)
0.8058361391694725 {'max_features': 9, 'n_estimators': 192}
0.8002244668911336 {'max_features': 8, 'n_estimators': 171}
0.8103254769921436 {'max_features': 10, 'n_estimators': 120}
0.8002244668911336 {'max_features': 7, 'n_estimators': 182}
0.8013468013468014 {'max_features': 8, 'n_estimators': 174}
0.8092031425364759 {'max_features': 10, 'n_estimators': 199}
0.8002244668911336 {'max_features': 8, 'n_estimators': 121}
0.8058361391694725 {'max_features': 10, 'n_estimators': 101}
0.8058361391694725 {'max_features': 9, 'n_estimators': 129}
0.8024691358024691 {'max_features': 7, 'n_estimators': 163}
rnd_clf = rnd_search.best_estimator_ # rnd_search.best_estimator_ is our model
cross_val_score(rnd_clf, X_train, y_train, cv=10, scoring='accuracy').mean()
0.824990352967881

The score from the cross_val_score is actually higher than that of GridSearch.

The RandomizedSearchCV and the GridSearchCV are great, each one in its own way.

You can try many comprehensive data manipulation techniques (e.g feature engineering) that can increase the accuracy of your model so that you can upload your predictions with a smile.

For now, Goodbye!

Coding Is Fun But I’ve Gotta Run :)