import sklearn
sklearn.__version__
'0.21.3'
Divine Saungweme explores pipelines, cv and hyperparameter tuning in SKLearn.
Hello, Data Scientists
In this episode, I’ll be demonstrating how powerful some 3 Sci-Kit-Learn features are and how we can use these features to prepare data, to choose models and to fine-tune the model without breaking any sweat because of the simplicity and automation.
The features are: - Transformation Pipelines - Cross Validation - And last but not least, Hyperparameter Tuning with Grid Search and Randomized Search
We will discover how these features can help us and why they are really worth putting in your Data science tool-kit.
We wil be using the Titanic dataset from Kaggle Competitions (Hope this episode won’t be a disaster too like the Titanic :)
Before we begin, if you are using a backward version of Sci-Kit-Learn you may have problems in importing some packages. In this tutorial, I am using version ‘0.21.3’.
You can use the following piece of code to see the version
Let’s import the common imports
With the Titanic dataset, we have to create a model that predicts which passengers survived the Titanic shipwreck. We have to predict what sort of people were likely survive using the Passenger information e.g name, gender, passenger class, etc….. So, now let’s load the data…
The data has already been split into Training and Testing data
Since this is a Kaggle Competition dataset, there are no labels in the Test data (with the attribute name ‘Survived’). We will just compile our predictions into a csv file (in respect of Kaggles’ formating) , upload the predictions (as a csv file) to Kaggle and see our final score.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Some insight about the data
The attributes have the following meaning: * Survived: that’s the target, 0 means the passenger did not survive, while 1 means he/she survived. * Pclass: passenger class. * Name, Sex, Age: self-explanatory * SibSp: how many siblings & spouses of the passenger aboard the Titanic. * Parch: how many children & parents of the passenger aboard the Titanic. * Ticket: ticket id * Fare: price paid (in pounds) * Cabin: passenger’s cabin number * Embarked: where the passenger embarked the Titanic
Let’s check for any missing data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
As we can see, the attributes: Age, Cabin and Embarked have some null values and Cabin has the most null values.
We will not be using Cabin in this tutorial and we will also not use the Name and Ticket attributes.
We can easily use the Age and the Embarked attributes so we will transform them later.
Let take a sneak peak at the numerical attributes
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Let also take a sneak peak at the categorial attributes
The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.
train_data_copy = train_data.copy()
# Use the "copy()" function call to avoid changing the original data (which in this case is "train_data")
# We have to poke around with transformations using the copy of train_data and see what we can archieve
train_data_copy.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Let’s start using Transformations.
Using the “so-called” regular way we would need to deal with the NaN values first
from sklearn.base import BaseEstimator, TransformerMixin
# Inspired from stackoverflow.com/questions/25239958
class MostFrequentImputer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.most_frequent_)
The class transforms both numerical and categorial (object) data. It replaces the NaN values in the data with the most frequent value in the Data Attribute. For example, If we had NaN values in the Sex attribute, we would replace the NaN values with the most frequent value in the Sex attribute (which in this case is Male).
We will use our Transformer Class (MostFrequentImputer) on categorial attributes. We will use Simple Imputer for numerical attributes. Let’s start with numerical attributes
impute.fit(train_data_copy[["Age"]])
train_data_copy[["Age"]] = impute.transform(train_data_copy[["Age"]])
train_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
As we can see, the values from our imputed attribute (Age) are now 891, the NaN values in the attribute have been wiped off from existance.
Now, let’s impute the categorial attributes with our Transform Class
cat_impute = MostFrequentImputer()
cat_impute.fit(train_data_copy[["Embarked"]])
train_data_copy[["Embarked"]] = cat_impute.transform(train_data_copy[["Embarked"]])
train_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Well, what do you know?. Now, only the Cabin attribute is left (because we didn’t include it in our transformation). We are not going to use it along with Name and Ticket, so let’s just drop these attributes
Phew!!, we have finally got rid of the NaN values so what’s next. We want to scale the Numerical Attributes (some algorithms work better with scaled data). For categories we have to convert the categories from strings to usable numbers.
from sklearn.preprocessing import StandardScaler
num_attribs = ["Age", "SibSp", "Parch", "Fare", "Pclass"]
# Pclass is a category but it's already in numerical so scaling it would be a good idea than labeling it in this case
scaler = StandardScaler()
train_data_copy[num_attribs] = scaler.fit_transform(train_data_copy[num_attribs])
train_data_copy.head()
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0.827377 | male | -0.565736 | 0.432793 | -0.473674 | -0.502445 | S |
1 | 2 | 1 | -1.566107 | female | 0.663861 | 0.432793 | -0.473674 | 0.786845 | C |
2 | 3 | 1 | 0.827377 | female | -0.258337 | -0.474545 | -0.473674 | -0.488854 | S |
3 | 4 | 1 | -1.566107 | female | 0.433312 | 0.432793 | -0.473674 | 0.420730 | S |
4 | 5 | 0 | 0.827377 | male | 0.433312 | -0.474545 | -0.473674 | -0.486337 | S |
The selected attributes have been scaled, Now our models can predict much better, Now what’s left is tagging numerical labels to our Categorial attributes
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0.827377 | 1 | -0.565736 | 0.432793 | -0.473674 | -0.502445 | 2 |
1 | 2 | 1 | -1.566107 | 0 | 0.663861 | 0.432793 | -0.473674 | 0.786845 | 0 |
2 | 3 | 1 | 0.827377 | 0 | -0.258337 | -0.474545 | -0.473674 | -0.488854 | 2 |
3 | 4 | 1 | -1.566107 | 0 | 0.433312 | 0.432793 | -0.473674 | 0.420730 | 2 |
4 | 5 | 0 | 0.827377 | 1 | 0.433312 | -0.474545 | -0.473674 | -0.486337 | 2 |
Finally we have finished preparing our data. What a lot of tiring work that was, Good news, you don’t have to tire yourself with this tedious technique.
Pipelines coming to the rescue…
Why should I care about Pipelines ????
As we have seen (and coded as well), we have many transformations that need to be executed in the right order, for example: We cannot scale data whilst we still have NaN values in the data (You end up getting many frustrating errors).
Pipelines take all the transformations and bind them together inorder to prepare/transform the data in the right order. All we should do is specify the order of executions by putting the transformation packages in the right order.
So how do go about setting up these so-called Pipelines, let’s dive right into the Pipeline…
from sklearn.pipeline import Pipeline # how ironic, importing Pipeline from pipeline ;)
from sklearn.preprocessing import OneHotEncoder
# In this one we will use OneHotEncoder instead of LabelEncoder as OneHotEncoder tends to do a better job than LabelEncoder
# We will make 2 Pipelines, one for Numerical Attributes and the other for Categorial Attributes
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
# The name "imputer" can be set to any string e.g "impute" or "whatever"
('scaler', StandardScaler())
])
# I had already imported these packages before so there's no need for repetition
cat_pipeline = Pipeline([
('imputer', MostFrequentImputer()), # from the class we had made earlier
('encoding', OneHotEncoder(sparse=False))
# Setting sparse to False prevents the OneHotEncoder from returning a Scipy sparse matrix
])
There is even a much faster way of setting up a Pipeline. Sometimes we may have no need of naming each step taken in preprocessing. Using “make_pipeline”, we can save time although we would sacrifice some features a general Pipeline offers. You can use the following code below to set up the faster-to-setup Pipeline (although we are not going to use it, we will just stick to the general pipeline).
More about OneHotEncoder
A OneHotEncoder creates binary columns (attributes) from the category
What does that mean??
For example: The Embarked category. The attributes in the category Embarked are: S; C; and Q.
We get 3 columns for this category (because it has 3 attributes stated earlier)
If the passenger embarked from S, the S column will have 1 (making it hot and also representing True) and the other columns (C and Q) will have 0 (making them cold and also representing False)
If the passenger embarked from C, the C column will have 1 (making it hot and also representing True) and the other columns (S and Q) will have 0 (making them cold and also representing False)
…..
So we will have extra columns..
Now that we have made our Pipelines, we will combine them with ColumnTransformer
from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
('num_pipeline', num_pipeline, ["Age", "SibSp", "Parch", "Fare"]),
# We just choose the numerical attributes we want
('cat_pipeline', cat_pipeline, ["Pclass", "Sex", "Embarked"])
# Here, we just choose the categorial attributes we want and here Pclass works better as a category than a number
])
X_train = full_pipeline.fit_transform(train_set)
# The transformation returns our data as a numpy array
# Only the attributes of numbers and categories that we have specified in the full_pipeline (which are just 7) will
# be present in the data so there is really no need of dropping attributes like we did before in our "Regular
# Transformation" detour because they have been automatically dropped.
X_train[:5] # Our numpy array's head, Similar with Pandas .head() function call :)
# Voila...
array([[-0.56573646, 0.43279337, -0.47367361, -0.50244517, 0. ,
0. , 1. , 0. , 1. , 0. ,
0. , 1. ],
[ 0.66386103, 0.43279337, -0.47367361, 0.78684529, 1. ,
0. , 0. , 1. , 0. , 1. ,
0. , 0. ],
[-0.25833709, -0.4745452 , -0.47367361, -0.48885426, 0. ,
0. , 1. , 1. , 0. , 0. ,
0. , 1. ],
[ 0.4333115 , 0.43279337, -0.47367361, 0.42073024, 1. ,
0. , 0. , 1. , 0. , 0. ,
0. , 1. ],
[ 0.4333115 , -0.4745452 , -0.47367361, -0.48633742, 0. ,
0. , 1. , 0. , 1. , 0. ,
0. , 1. ]])
# As you saw, Pipelines are easily managable than the whole transformation process we had earlier and Pipelines also
# take less time to set up
# Let's see how our data looks like as a DataFrame
# Pclass: 1, 2, 3
# Sex: Female, Male
# Embarked: S,C, Q
columns = ["Age", "SibSp", "Parch", "Fare", 'Pclass-1', 'Pclass-2', 'Pclass-3',
'Sex-Female', 'Sex-Male', 'Embarked-C', 'Embarked-Q', 'Embarked-S']
pd.DataFrame(X_train, columns=columns).head()
Age | SibSp | Parch | Fare | Pclass-1 | Pclass-2 | Pclass-3 | Sex-Female | Sex-Male | Embarked-C | Embarked-Q | Embarked-S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.565736 | 0.432793 | -0.473674 | -0.502445 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
1 | 0.663861 | 0.432793 | -0.473674 | 0.786845 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | -0.258337 | -0.474545 | -0.473674 | -0.488854 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 0.433312 | 0.432793 | -0.473674 | 0.420730 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.433312 | -0.474545 | -0.473674 | -0.486337 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# We will start by using the Stochastic Gradient Descent Classifier
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='hinge',
max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
power_t=0.5, random_state=42, shuffle=True, tol=0.001,
validation_fraction=0.1, verbose=0, warm_start=False)
How can we get an idea of how our model performs, What if it turns out be very aweful. Luckily we don’t have to rely on guess-work.
Let’s try predicting the training data
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dt_clf = DecisionTreeClassifier().fit(X_train, y_train)
dt_pred = dt_clf.predict(X_train)
accuracy_score(y_train, dt_pred)
0.9797979797979798
Before you smile in satisfaction, those numbers are biased. The model is simply overfitting
Predicting the train data mostly results in Models overfitting the data (as we have seen), So how can we get a reliable score without worring about overfitting…
We can can use Cross Validation
from sklearn.model_selection import cross_val_score
cvs = cross_val_score(sgd_clf, X_train, y_train, scoring='accuracy', cv=10)
scores(cvs)
[0.76666667 0.72222222 0.7752809 0.86516854 0.7752809 0.73033708
0.75280899 0.78651685 0.80898876 0.78409091]
Mean: 0.7767361820451708
Standard Deviation: 0.03848773108423685
Maximum: 0.7222222222222222
Minimun: 0.8651685393258427
So what just happened, you may be wondering how we got the score
The cross_val_score randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.
The “cv” parameter determines how many folds we want our cross_val_score function to have (which in this case is 10).
NB. The cv should be greater than 1. cv > 1
The scoring parameter determines what kind of score we want to get, setting it to: - “accuracy” - gives us the accuracy score - “precision” - gives us the precision score - “recall” - gives us the recall score and - “f1” - gives us the f1 score…
cross_val_score was adapted from StratifiedKFold, this is how it looks when using StratifiedKFold.
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=10, random_state=42)
for train_index, test_index in skfolds.split(X_train, y_train):
clone_clf = clone(sgd_clf)
X_train_folds = X_train[train_index]
y_train_folds = (y_train[train_index])
X_test_fold = X_train[test_index]
y_test_fold = (y_train[test_index])
clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred == y_test_fold)
print(n_correct / len(y_pred))
0.7666666666666667
0.7222222222222222
0.7752808988764045
0.8651685393258427
0.7752808988764045
0.7303370786516854
0.7528089887640449
0.7865168539325843
0.8089887640449438
0.7840909090909091
If we compare the results, they are pretty much the same, but using StratifiedKFold is a lot of work as compared to using cross_val_score.
Cross validation can be used to see how our model is able generalize data. If you do not have enough data to populate both the train and test sets, you can definitely use Cross Validation.
# And if you would like to get some predictions so that you can compare them with the y_train set you can simply:
from sklearn.model_selection import cross_val_predict
predictions = cross_val_predict(sgd_clf, X_train, y_train, cv=10)
accuracy_score(y_train, predictions)
0.77665544332211
# If you would like to get some Decision Functions you can simply:
predictions = cross_val_predict(sgd_clf, X_train, y_train, cv=10, method='decision_function')
predictions[:5]
array([-2.80887674, 2.61659364, -0.13381379, 1.82152503, -2.71375558])
If you are using a model that supports “Prediction Probabilities” you can simple set the method hyperparameter to “predict_proba”
# Let's try another model
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
cross_val_score(knn_clf, X_train, y_train, scoring='accuracy', cv=10).mean()
0.8069611848825332
As it turns out, the KNeighborsClassifier did a better job than the SGDClassifier, so it’s promising
Let just try one last model
# The other model
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
cross_val_score(rnd_clf, X_train, y_train, scoring='accuracy', cv=10).mean()
0.8115690614005221
The Random Forest Classifier did better than the other two but how can we optimise it, whilst also preventing overfitting
We can tweak the hyperparameters and see which ones get us somewhere…
# A very simple way of tweaking the n_neighbors parameter in KNeighborsClassifier is this way:
for number in range(1, 12):
knn_looped = KNeighborsClassifier(n_neighbors=number).fit(X_train, y_train)
score = cross_val_score(knn_looped, X_train, y_train, scoring='accuracy', cv=10).mean()
print(number, score)
1 0.7554752581999773
2 0.7958001361933945
3 0.7980603790716151
4 0.8002698331630915
5 0.80697366927704
6 0.809170355237771
7 0.8058631256384066
8 0.7991212688684599
9 0.7946646237657474
10 0.8069611848825332
11 0.7980101577573487
So, as we can see, putting the n_neighbors as 6 tends to give us a higher accuracy score
But what can we do if we a lot of hyperparameters in the model that we need to test out and we need a much organised way of doing it.
We can use GridSearchCV
from sklearn.model_selection import GridSearchCV
params = [
{'n_estimators': [100, 200, 300], 'max_features': [8, 9, 10, 11]}
]
# These are the parameters that we put in Random Forest model and test each and every combination
# So we have 12 combinations...
rnd_clf = RandomForestClassifier(random_state=42)
grid_rnd_clf = GridSearchCV(rnd_clf, params, cv=3, return_train_score=True, scoring='accuracy', verbose=2)
# The grid search takes the algorithm, parameters, folds/cv (number of trainings)
# The grid search undergoes cross validation, similar with the cross_val_score that we talked about earlier, it goes on...
# ... cross validating each and every combination we assigned it to
# So with the number of folds as 3, we can conclude that we will have 36 runs
# {(3 n_estimators) * (4 max_features) * (3 cv)} => 3 * 4 * 3 => 36
# 'verbose' gives us details of the runs, such as the time taken, etc, increasing the value increases the details...
grid_rnd_clf.fit(X_train, y_train)
# Fitting our Grid Search Model make take some time, maybe a few seconds
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] max_features=8, n_estimators=100 ................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ................. max_features=8, n_estimators=100, total= 0.2s
[CV] max_features=8, n_estimators=100 ................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[CV] ................. max_features=8, n_estimators=100, total= 0.2s
[CV] max_features=8, n_estimators=100 ................................
[CV] ................. max_features=8, n_estimators=100, total= 0.2s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total= 0.4s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total= 0.4s
[CV] max_features=8, n_estimators=200 ................................
[CV] ................. max_features=8, n_estimators=200, total= 0.5s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total= 0.7s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total= 0.7s
[CV] max_features=8, n_estimators=300 ................................
[CV] ................. max_features=8, n_estimators=300, total= 0.7s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total= 0.2s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total= 0.2s
[CV] max_features=9, n_estimators=100 ................................
[CV] ................. max_features=9, n_estimators=100, total= 0.2s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total= 0.5s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total= 0.6s
[CV] max_features=9, n_estimators=200 ................................
[CV] ................. max_features=9, n_estimators=200, total= 0.5s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total= 0.7s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total= 0.7s
[CV] max_features=9, n_estimators=300 ................................
[CV] ................. max_features=9, n_estimators=300, total= 0.7s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total= 0.3s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total= 0.2s
[CV] max_features=10, n_estimators=100 ...............................
[CV] ................ max_features=10, n_estimators=100, total= 0.3s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total= 0.5s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total= 0.6s
[CV] max_features=10, n_estimators=200 ...............................
[CV] ................ max_features=10, n_estimators=200, total= 0.5s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total= 0.9s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total= 0.8s
[CV] max_features=10, n_estimators=300 ...............................
[CV] ................ max_features=10, n_estimators=300, total= 0.7s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total= 0.2s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total= 0.2s
[CV] max_features=11, n_estimators=100 ...............................
[CV] ................ max_features=11, n_estimators=100, total= 0.3s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total= 0.5s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total= 0.6s
[CV] max_features=11, n_estimators=200 ...............................
[CV] ................ max_features=11, n_estimators=200, total= 0.5s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total= 0.9s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total= 1.0s
[CV] max_features=11, n_estimators=300 ...............................
[CV] ................ max_features=11, n_estimators=300, total= 0.8s
[Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 20.3s finished
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators='warn', n_jobs=None,
oob_score=False, random_state=42,
verbose=0, warm_start=False),
iid='warn', n_jobs=None,
param_grid=[{'max_features': [8, 9, 10, 11],
'n_estimators': [100, 200, 300]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring='accuracy', verbose=2)
{'max_features': 10, 'n_estimators': 200}
# Let's see the whole parameters and their scores
data = grid_rnd_clf.cv_results_
for a, b in zip(data['mean_test_score'], data['params']):
print(a, b)
0.8024691358024691 {'max_features': 8, 'n_estimators': 100}
0.8035914702581369 {'max_features': 8, 'n_estimators': 200}
0.8013468013468014 {'max_features': 8, 'n_estimators': 300}
0.8013468013468014 {'max_features': 9, 'n_estimators': 100}
0.8047138047138047 {'max_features': 9, 'n_estimators': 200}
0.8047138047138047 {'max_features': 9, 'n_estimators': 300}
0.8058361391694725 {'max_features': 10, 'n_estimators': 100}
0.8092031425364759 {'max_features': 10, 'n_estimators': 200}
0.8069584736251403 {'max_features': 10, 'n_estimators': 300}
0.8058361391694725 {'max_features': 11, 'n_estimators': 100}
0.8024691358024691 {'max_features': 11, 'n_estimators': 200}
0.8058361391694725 {'max_features': 11, 'n_estimators': 300}
It seems as if our best Grid Search model is worse than the regular Random Forest Classifier that we made the first time. This is because we have used a cv of 3 on the Grid Search Random Forest Classifier and a cv of 10 on the Regular Random Forest Classifier. Let’s cross_val_score our best model and see how it does with a cv of 10…
rnd_clf = grid_rnd_clf.best_estimator_ # grid_rnd_clf.best_estimator_ is our model
cross_val_score(rnd_clf, X_train, y_train, cv=10, scoring='accuracy').mean()
0.8227681307456589
As it turns out, the Grid Search Random Forest Classifier is actually better than the Regular Random Forest Classifier
And if you are curious:
RandomForestClassifier(random_state=42, n_estimators=200, max_features=10)
and
grid_rnd_clf.best_estimator_
Produce very slightly different results…
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {'n_estimators': randint(low=100, high=200), 'max_features': randint(low=6, high=11)}
rnd_clf2 = RandomForestClassifier(random_state=42)
rnd_search = RandomizedSearchCV(rnd_clf2, param_distributions = param_distribs, n_iter=10, cv=3, scoring='accuracy',
random_state=42, verbose=2)
rnd_search.fit(X_train, y_train)
# Too much randomness on this one, Ehh :)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] max_features=9, n_estimators=192 ................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ................. max_features=9, n_estimators=192, total= 0.5s
[CV] max_features=9, n_estimators=192 ................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.4s remaining: 0.0s
[CV] ................. max_features=9, n_estimators=192, total= 0.5s
[CV] max_features=9, n_estimators=192 ................................
[CV] ................. max_features=9, n_estimators=192, total= 0.5s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total= 0.4s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total= 0.4s
[CV] max_features=8, n_estimators=171 ................................
[CV] ................. max_features=8, n_estimators=171, total= 0.4s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total= 0.3s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total= 0.3s
[CV] max_features=10, n_estimators=120 ...............................
[CV] ................ max_features=10, n_estimators=120, total= 0.3s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total= 0.4s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total= 0.4s
[CV] max_features=7, n_estimators=182 ................................
[CV] ................. max_features=7, n_estimators=182, total= 0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total= 0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total= 0.4s
[CV] max_features=8, n_estimators=174 ................................
[CV] ................. max_features=8, n_estimators=174, total= 0.4s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total= 0.5s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total= 0.5s
[CV] max_features=10, n_estimators=199 ...............................
[CV] ................ max_features=10, n_estimators=199, total= 0.5s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total= 0.3s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total= 0.3s
[CV] max_features=8, n_estimators=121 ................................
[CV] ................. max_features=8, n_estimators=121, total= 0.3s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total= 0.4s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total= 0.2s
[CV] max_features=10, n_estimators=101 ...............................
[CV] ................ max_features=10, n_estimators=101, total= 0.4s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total= 0.5s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total= 0.3s
[CV] max_features=9, n_estimators=129 ................................
[CV] ................. max_features=9, n_estimators=129, total= 0.3s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total= 0.4s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total= 0.4s
[CV] max_features=7, n_estimators=163 ................................
[CV] ................. max_features=7, n_estimators=163, total= 0.4s
[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 11.3s finished
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators='warn',
n_jobs=None,
oob_sc...
warm_start=False),
iid='warn', n_iter=10, n_jobs=None,
param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BB2F0C8>,
'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BB2FFC8>},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=2)
RandomizedSearchCV takes a dictionary with the parameters, it takes random numbers between the low and high rating and picks randomly the combinations.
So the difference between GridSearchCV and RandomizedSearchCV is:
With GridSearchCV you specify the parameters that you want to combine but with RandomizedSearchCV, the model automatically picks random parameters between the specified limits (which are low and high)
(0.8103254769921436, {'max_features': 10, 'n_estimators': 120})
# Let's see the whole parameters and their scores
data = rnd_search.cv_results_
for a, b in zip(data['mean_test_score'], data['params']):
print(a, b)
0.8058361391694725 {'max_features': 9, 'n_estimators': 192}
0.8002244668911336 {'max_features': 8, 'n_estimators': 171}
0.8103254769921436 {'max_features': 10, 'n_estimators': 120}
0.8002244668911336 {'max_features': 7, 'n_estimators': 182}
0.8013468013468014 {'max_features': 8, 'n_estimators': 174}
0.8092031425364759 {'max_features': 10, 'n_estimators': 199}
0.8002244668911336 {'max_features': 8, 'n_estimators': 121}
0.8058361391694725 {'max_features': 10, 'n_estimators': 101}
0.8058361391694725 {'max_features': 9, 'n_estimators': 129}
0.8024691358024691 {'max_features': 7, 'n_estimators': 163}
rnd_clf = rnd_search.best_estimator_ # rnd_search.best_estimator_ is our model
cross_val_score(rnd_clf, X_train, y_train, cv=10, scoring='accuracy').mean()
0.824990352967881
The score from the cross_val_score is actually higher than that of GridSearch.
The RandomizedSearchCV and the GridSearchCV are great, each one in its own way.
You can try many comprehensive data manipulation techniques (e.g feature engineering) that can increase the accuracy of your model so that you can upload your predictions with a smile.
For now, Goodbye!
Coding Is Fun But I’ve Gotta Run :)