¶

Random Search with MLflow¶

¶

Presented by Robin Allesiardo from PJD/CT_Foxtrot (rallesiardo_at_solocal.com)

during the Solocal AI Community of Practice (SEP 10, 2019) solocal

What we will see¶

Mathematics around model selection
Why random-search is better than grid-search
How to implement a random-search
How to use MLflow to log results of a random-search

Risk Minimization¶

Setting:

Two space of objects $X$ and $Y$
$\pi \in \Pi$ such as $\pi : X \rightarrow Y$
Non-negative real-valued loss function $L(\hat{y},y)$

Goals:

Minimizing $R(\pi)$ = ${\bf E}$ $\left[ L(\pi(x),y) \right]$
Finding $\pi^*$ = $\arg$ $\min_{\pi \in \Pi}$ $R(\pi)$

Bias-Variance Tradeoff¶

$\Pi_m$ is the space of all models built with hyperparameters $m$
$\pi_{m,t}$ is the model obtained after training the model $\pi_{m,0}$ with hyperparameters $m$
$\pi_m^* = \arg \min_{\pi \in \Pi_m} R(\pi)$

tradeoff

Empirical Risk Minimization¶

Empirical Risk : $$R_\text{emp}(\pi) = \frac{\sum^n_{i = 1} L(\pi(x_i),y_i) }{n} \text{.}$$

McDiarmid inequality for any $\epsilon > 0$ : $$ P( |\sum^n_{i = 1} L(\pi(x_i),y_i) - nR(\pi))| \geq \epsilon) \leq \text{exp}\left( \frac{2\epsilon^2}{\sum^n_{i=1} c^2} \right)\text{,}$$ where $c > | L(\pi(x_i),y_i) - L(\pi(x_j),y_j) |$

Parameters Search¶

randomsearch

From Random search for hyper-parameter optimization by James Bergstra and Yoshua Bengio, JMLR vol 13, 2019

A Classification Problem¶

In [2]:

import sklearn
from sklearn import datasets
digits = datasets.load_digits()

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size= 0.5, shuffle = True, random_state = 0)

In [3]:

from sklearn import linear_model

param = {
"max_iter" : 1000,
"tol" : 1e-3,
}
# Définition du modèle
sgd = linear_model.SGDClassifier(**param)

# Optimisation des paramètres du modèle
sgd = sgd.fit(x_train, y_train)

sklearn.metrics.accuracy_score(sgd.predict(x_test), y_test)

Out[3]:

0.9410456062291435

Random Search¶

In [4]:

import numpy as np
from tqdm import tqdm_notebook as tqdm

best_model = None
max_accuracy = 0

for i in tqdm(range(100)):
    
    param = {"max_iter" : np.random.randint(2000,100000),\
             "penalty" : np.random.choice(["none", "l2", "l1", "elasticnet"]),\
             "alpha": np.random.uniform(0.00001,0.001),\
             "tol" : np.random.uniform(0,1e-3)}
    
    sgd = linear_model.SGDClassifier(**param)
    sgd = sgd.fit(x_train, y_train)
    
    accuracy_test = sklearn.metrics.accuracy_score(y_test, sgd.predict(x_test))
    
    if accuracy_test > max_accuracy:
        max_accuracy = accuracy_test
        best_model = sgd
        
max_accuracy

Out[4]:

0.9555061179087876

Minimal MLflow logging¶

import mlflow

mlflow.set_experiment("Experiment")

    with mlflow.start_run():

        ...

        mlflow.log_param("param", param_value)

        for i, ...

            #Do Stuff
            ...

            mlflow.log_metric("metric", metric_value, i)

        artifact_path = mlflow.get_artifact_uri()

Then start the tracking server with mlflow ui -p $port

mlflow1

mlflow2

In [5]:

import mlflow, json, os
mlflow.set_experiment("Digits")

for i in tqdm(range(100)):
    
    with mlflow.start_run():

        param = {"max_iter" : np.random.randint(2000,100000),
                 "penalty" : np.random.choice(["none", "l2", "l1", "elasticnet"]),
                 "alpha": np.random.uniform(0.00001,0.001),
                 "tol" : np.random.uniform(0,1e-3)}

        for i_key, i_value in param.items():
            mlflow.log_param(i_key, i_value)

        sgd = linear_model.SGDClassifier(**param)
        sgd = sgd.fit(x_train, y_train)

        mlflow.log_metric("acc_train",\
                          sklearn.metrics.accuracy_score(y_train, sgd.predict(x_train)))
        mlflow.log_metric("acc_test",\
                          sklearn.metrics.accuracy_score(y_test, sgd.predict(x_test)))

        artifact_path = mlflow.get_artifact_uri()
        artifact_path = artifact_path.replace("%20", " ").replace("file://", "")
        
        from joblib import dump, load
        dump(sgd, os.path.join(artifact_path, "linear.w"))
        with open(os.path.join(artifact_path, "param.json"), 'w') as f:
            json.dump(param, f)

In [6]:

df = mlflow.search_runs()
df.head()

Out[6]:

	run_id	experiment_id	status	artifact_uri	metrics.acc_train	metrics.acc_test	params.tol	params.alpha	params.penalty	params.max_iter	tags.mlflow.user	tags.mlflow.source.type	tags.mlflow.source.name
0	a55959a22dc849d38df5a39af67ed116	1	FINISHED	file:///Users/rallesiardo/OneDrive%20-%20Pages...	0.985523	0.946607	0.0007339435714911382	0.0008538356754106561	none	21146	rallesiardo	LOCAL	/Library/Frameworks/Python.framework/Versions/...
1	5737a49311c34b6f982af5c08684a451	1	FINISHED	file:///Users/rallesiardo/OneDrive%20-%20Pages...	0.962138	0.923248	0.0006091422884751423	0.0001344700195589645	l2	93658	rallesiardo	LOCAL	/Library/Frameworks/Python.framework/Versions/...
2	bebea13f4d204a32ac410da89fe84144	1	FINISHED	file:///Users/rallesiardo/OneDrive%20-%20Pages...	0.991091	0.938821	0.0006657589132622829	0.000892890427019747	elasticnet	93896	rallesiardo	LOCAL	/Library/Frameworks/Python.framework/Versions/...
3	cd8957bd31e648b38721c5c617a79998	1	FINISHED	file:///Users/rallesiardo/OneDrive%20-%20Pages...	0.988864	0.947720	0.0008679598060894229	0.0005883256391608072	none	83843	rallesiardo	LOCAL	/Library/Frameworks/Python.framework/Versions/...
4	615b89bdf8624d8386d557a0630a0662	1	FINISHED	file:///Users/rallesiardo/OneDrive%20-%20Pages...	0.992205	0.953281	0.0009000008393179844	0.0002502397717589587	none	48340	rallesiardo	LOCAL	/Library/Frameworks/Python.framework/Versions/...

In [9]:

plot = df["metrics.acc_test"].hist().plot()

In [8]:

best_idx = df["metrics.acc_test"].idxmax()

df.iloc[best_idx]

Out[8]:

run_id                                      25e4a00c840c4bad86d43eb91dc2dee3
experiment_id                                                              1
status                                                              FINISHED
artifact_uri               file:///Users/rallesiardo/OneDrive%20-%20Pages...
metrics.acc_train                                                   0.991091
metrics.acc_test                                                    0.957731
params.tol                                             0.0007230895043225555
params.alpha                                           0.0004112291559069391
params.penalty                                                            l1
params.max_iter                                                        83810
tags.mlflow.user                                                 rallesiardo
tags.mlflow.source.type                                                LOCAL
tags.mlflow.source.name    /Library/Frameworks/Python.framework/Versions/...
Name: 40, dtype: object

Don't train on your test!¶

Don't forget that the Random Search is somewhat part of your training
Test your final model with data not used during the random search