max_depth=None,  # we know the best for this is 3 but what about the other?
min_impurity_decrease=0.0

max_depth=None,  # we know the best for this is 3 but what about the other?
min_impurity_decrease=0.0

max_depth=None, 
min_impurity_decrease=0.0
min_samples_leaf=1, 
max_leaf_nodes=None,

# Set the parameters by cross-validation
tuned_parameters = [
    {"max_depth": [1, 2, 3, None],
     "min_impurity_decrease": [0.001, 0.01, 0.1, ],
     "criterion": ["gini", "entropy"],
    }
]
....
clf = GridSearchCV(DTC(), tuned_parameters, scoring="%s" % score)


from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier as DTC

# Loading the Digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [
    {"max_depth": [1, 2, 3, None],
     "min_impurity_decrease": [0.001, 0.01, 0.1, ],
     "criterion": ["gini", "entropy"],
    }
]

scores = ["accuracy"]

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(DTC(), tuned_parameters, scoring="%s" % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_["mean_test_score"]
    stds = clf.cv_results_["std_test_score"]
    for mean, std, params in zip(means, stds, clf.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Note the problem is too easy: the hyperparameter plateau is too flat and the
# output model is the same for precision and recall with ties in quality.

# Tuning hyper-parameters for accuracy

Best parameters set found on development set:

{'criterion': 'entropy', 'max_depth': None, 'min_impurity_decrease': 0.001}

Grid scores on development set:

0.209 (+/-0.007) for {'criterion': 'gini', 'max_depth': 1, 'min_impurity_decrease': 0.001}
0.209 (+/-0.007) for {'criterion': 'gini', 'max_depth': 1, 'min_impurity_decrease': 0.01}
0.107 (+/-0.003) for {'criterion': 'gini', 'max_depth': 1, 'min_impurity_decrease': 0.1}
0.328 (+/-0.007) for {'criterion': 'gini', 'max_depth': 2, 'min_impurity_decrease': 0.001}
0.328 (+/-0.007) for {'criterion': 'gini', 'max_depth': 2, 'min_impurity_decrease': 0.01}
0.107 (+/-0.003) for {'criterion': 'gini', 'max_depth': 2, 'min_impurity_decrease': 0.1}
0.479 (+/-0.026) for {'criterion': 'gini', 'max_depth': 3, 'min_impurity_decrease': 0.001}
0.478 (+/-0.024) for {'criterion': 'gini', 'max_depth': 3, 'min_impurity_decrease': 0.01}
0.107 (+/-0.003) for {'criterion': 'gini', 'max_depth': 3, 'min_impurity_decrease': 0.1}
0.842 (+/-0.045) for {'criterion': 'gini', 'max_depth': None, 'min_impurity_decrease': 0.001}
0.786 (+/-0.071) for {'criterion': 'gini', 'max_depth': None, 'min_impurity_decrease': 0.01}
0.107 (+/-0.003) for {'criterion': 'gini', 'max_depth': None, 'min_impurity_decrease': 0.1}
0.200 (+/-0.013) for {'criterion': 'entropy', 'max_depth': 1, 'min_impurity_decrease': 0.001}
0.200 (+/-0.013) for {'criterion': 'entropy', 'max_depth': 1, 'min_impurity_decrease': 0.01}
0.200 (+/-0.013) for {'criterion': 'entropy', 'max_depth': 1, 'min_impurity_decrease': 0.1}
0.351 (+/-0.025) for {'criterion': 'entropy', 'max_depth': 2, 'min_impurity_decrease': 0.001}
0.351 (+/-0.025) for {'criterion': 'entropy', 'max_depth': 2, 'min_impurity_decrease': 0.01}
0.351 (+/-0.025) for {'criterion': 'entropy', 'max_depth': 2, 'min_impurity_decrease': 0.1}
0.527 (+/-0.039) for {'criterion': 'entropy', 'max_depth': 3, 'min_impurity_decrease': 0.001}
0.527 (+/-0.039) for {'criterion': 'entropy', 'max_depth': 3, 'min_impurity_decrease': 0.01}
0.519 (+/-0.050) for {'criterion': 'entropy', 'max_depth': 3, 'min_impurity_decrease': 0.1}
0.845 (+/-0.025) for {'criterion': 'entropy', 'max_depth': None, 'min_impurity_decrease': 0.001}
0.827 (+/-0.041) for {'criterion': 'entropy', 'max_depth': None, 'min_impurity_decrease': 0.01}
0.614 (+/-0.057) for {'criterion': 'entropy', 'max_depth': None, 'min_impurity_decrease': 0.1}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

           0       0.92      0.89      0.91        27
           1       0.88      0.86      0.87        35
           2       0.93      0.78      0.85        36
           3       0.69      0.93      0.79        29
           4       0.86      0.83      0.85        30
           5       0.87      0.82      0.85        40
           6       0.97      0.86      0.92        44
           7       0.90      0.97      0.94        39
           8       0.78      0.90      0.83        39
           9       0.82      0.76      0.78        41

    accuracy                           0.86       360
   macro avg       0.86      0.86      0.86       360
weighted avg       0.87      0.86      0.86       360

{'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf'], 'class_weight':['balanced', None]}

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
> array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])


y = np.array([0, 0, 1, 1, 1, 0])
scores = np.array([0.1, 0.4, 0.35, 0.8, 0.01, 0.2])


import numpy as np
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y, scores, drop_intermediate=False)
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % auc(fpr, tpr),
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()


def thrs_count(scores, index, denom, thrs, normalize=True):
    count = sum(scores[index] >= thrs)
    return count/denom if normalize else count


pos = y == 1  # indexing the positive
neg = y == 0  # indexing the negative
n_pos = sum(pos)  # sum the ground-truth positve
n_neg = sum(neg)  # sum the ground-truth negative
sort_scores = sorted(scores, reverse=True)  # sort highest first
sort_scores = [sort_scores[0] + 1] + sort_scores  # add extra point
ROC = [(thrs_count(scores, pos, n_pos, thrs), thrs_count(scores, neg, n_neg, thrs))
       for thrs in sort_scores] # for all thresholds, count TP and FP and normalize them
my_TPR, my_FPR = list(map(lambda *args: args, *ROC)) # reshape


assert all([np.allclose(tpr, my_TPR),
            np.allclose(fpr, my_FPR),
            np.allclose(thresholds, sort_scores)]), 'Your ROC is wrong'


plt.figure(figsize=(7,7))
plt.plot(my_FPR,my_TPR);
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.show()


AUC = np.dot(np.diff(fpr), tpr[1:])
print(AUC)
assert np.allclose(AUC, auc(fpr, tpr)), 'my AUC is WRONG'

0.5555555555555556


y = np.array([-1, 1, -1, 1, -1, 1])
scores = np.array([1000, 0.5, -0.9, 0.8, -0.1, 0.1])


import numpy as np
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y, scores, drop_intermediate=False, pos_label=1)
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % auc(fpr, tpr),
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()


pos = y == 1  # indexing the positive
neg = y == -1  # indexing the negative
n_pos = sum(pos)  # sum the ground-truth positve
n_neg = sum(neg)  # sum the ground-truth negative
sort_scores = sorted(scores, reverse=True)  # sort highest first
sort_scores = [sort_scores[0] + 1] + sort_scores  # add extra point
ROC = [(thrs_count(scores, pos, n_pos, thrs, normalize=False),
        thrs_count(scores, neg, n_neg, thrs, normalize=False))
       for thrs in sort_scores]  # for all thresholds, count TP and FP and normalize them
diff_fpr = np.diff(np.array(my_FPR))
my_TPR, my_FPR = list(map(lambda *args: args, *ROC))  # reshape
table = '|thrs 	|tpr 	|fpr 	|diff_fpr 	|\n|--- 	|--- 	|--- 	|--- 	\n'
for count, (mtpr, mfpr, thrs) in enumerate(zip(my_TPR, my_FPR, sort_scores)):
    diff_fpr = mfpr/n_neg-my_FPR[count-1]/n_neg
    table += f'|{thrs} 	|{mtpr}/{n_pos} 	|{mfpr}/{n_neg} 	| { str(diff_fpr)[:5] if count > 0 else 0}	\n'
print(table)

|thrs 	|tpr 	|fpr 	|diff_fpr 	|
|--- 	|--- 	|--- 	|--- 	
|1001.0 	|0/3 	|0/3 	| 0	
|1000.0 	|0/3 	|1/3 	| 0.333	
|0.8 	|1/3 	|1/3 	| 0.0	
|0.5 	|2/3 	|1/3 	| 0.0	
|0.1 	|3/3 	|1/3 	| 0.0	
|-0.1 	|3/3 	|2/3 	| 0.333	
|-0.9 	|3/3 	|3/3 	| 0.333


y = np.array([-1, 1, -1, 1, -1, 1])
alpha = 1
scores = np.array([-alpha, alpha, -alpha, alpha, -alpha, alpha])


import numpy as np
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y, scores, drop_intermediate=False, pos_label=1)
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % auc(fpr, tpr),
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()


y = np.array([-1, 1, -1, 1, -1, 1])
alpha = -1
scores = np.array([-alpha, alpha, -alpha, alpha, -alpha, alpha])


import numpy as np
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y, scores, drop_intermediate=False, pos_label=1)
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % auc(fpr, tpr),
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()

Train	70%	60%	80%
Validation (dev)	20%	20%	10%
Test	10%	20%	10%

Person ID (training example)	Overcooked pasta?	Waiting Time	Rude Waiter?	Satisfied $y$
$\mathbf{x}_1$	Yes	Long	No	1 (yes)
$\mathbf{x}_2$	No	Short	Yes	1 (yes)
$\mathbf{x}_3$	Yes	Long	Yes	0 (no)
$\mathbf{x}_4$	No	Long	Yes	1 (yes)
$\mathbf{x}_5$	Yes	Short	Yes	0 (no)

Person ID (training example)	[Feat. 1] Overcooked pasta?	[Feat. 2] Waiting Time	[Feat. 3] Rude Waiter?	Satisfied $y$
$\mathbf{x}_1$	Yes	Long	No	1 (yes)
$\mathbf{x}_2$	No	Short	Yes	1 (yes)
$\mathbf{x}_3$	Yes	Long	Yes	0 (no)
$\mathbf{x}_4$	No	Long	Yes	1 (yes)
$\mathbf{x}_5$	Yes	Short	Yes	0 (no)

Person ID (training example)	[Feat. 1] Overcooked pasta?	[Feat. 2] Waiting Time	[Feat. 3] Rude Waiter?	Satisfied $y$
$\mathbf{x}_1$	Yes	Long	No	1 (yes)
$\mathbf{x}_2$	No	Short	Yes	1 (yes)
$\mathbf{x}_3$	Yes	Long	Yes	0 (no)
$\mathbf{x}_4$	No	Long	Yes	1 (yes)
$\mathbf{x}_5$	Yes	Short	Yes	0 (no)

Person ID (training example)	[Feat. 1] Overcooked pasta?	[Feat. 2] Waiting Time	[Feat. 3] Rude Waiter?	Satisfied $y$
$\mathbf{x}_1$	Yes	Long	No	1 (yes)
$\mathbf{x}_2$	No	Short	Yes	1 (yes)
$\mathbf{x}_3$	Yes	Long	Yes	0 (no)
$\mathbf{x}_4$	No	Long	Yes	1 (yes)
$\mathbf{x}_5$	Yes	Short	Yes	0 (no)

Split Feature	IG
Overcooked Pasta	0.42
Waiting Time	0.020
Rude Waiter	0.171

Person ID (training example)	[Feat. 1] Overcooked pasta?	[Feat. 2] Waiting Time	[Feat. 3] Rude Waiter?	Satisfied $y$
$\mathbf{x}_1$	~~Yes~~	Long	No	1 (yes)
$\mathbf{x}_3$	~~Yes~~	Long	Yes	0 (no)
$\mathbf{x}_5$	~~Yes~~	Short	Yes	0 (no)

Machine Learning¶

9. Model Selection and Evaluation Metrics¶

Recap previous lecture¶

Today's lecture¶

It is about how to we evaluate models¶

1) Model selection and Cross-Validation¶

2) Hyper-parameter tuning¶

3) Metrics for Evaluation (mainly for classification)¶

This lecture material is taken from¶

ERM and Its Limits: Bias-Variance Trade-off¶

Bias-Variance Trade-off¶

Bias-Variance Tradeoff as Dartboard¶

BIAS-Variance Trade-off¶

Bias-VARIANCE Trade-off¶

Over or Under Fitting¶

Error in function of model complexity¶

Bias-Variance Error "Proof Sketch"¶

Bias-Variance Tradeoff as Dartboard¶

Sampling Distribution¶

Sampling Distribution¶

Bias-Variance Tradeoff as Dartboard¶

Introduction to Supervised Learning¶

Bayes optimal classifier and Bias-Variance Tradeoff¶

Error in function of model complexity¶

1) Model selection and Cross-Validation¶

Learning = a) Lower ⬇︎ the cost $\mathcal{L}$ in training AND b)⬇︎ also in test¶

Goal of learning is do well on unseen samples (low generalization error)¶

1) Held Out Validation Set¶

A single validation set (or development set)¶

Set size and partitioning: not a clear definition¶

Motivation for each split¶

Problem of a single held out split¶

Nested Cross-Validation¶

Nested Cross-Validation¶

Nested Cross-Validation¶

K-fold Cross-Validation¶

Which value for K? (10)¶

What if we have so little data that even K-fold won't work?¶

Leave-One-Out (LOO) Cross-Validation¶

Leave-One-Out (LOO) Cross-Validation¶

Cross-Validation Pseudo-Code¶

CV works if the data is sampled i.i.d. from same distribution¶

What to do after CV?¶

Practical Example¶

CV Application Caveat¶

Truly i.i.d. $\rightarrow$ K-fold¶

Truly i.i.d. $\rightarrow$ K-fold¶

Strong Class Imbalance $\rightarrow$ Stratified K-fold¶

Unseen Group in test $\rightarrow$ Group k-fold¶

2) Hyper-parameter tuning of an estimator¶

Hyper-parameter: definition¶

Hyper-parameters are parameters that are not directly learnt within estimators.¶

Recipe¶

Hyper-parameter tuning¶

What happens if we have to chose/validate 2 hyper-params?¶

Grid Search on a "plane" of hyper-params¶

How many models do we train with k=10 fold cross-validation and grid search over depth $\in [1,2,3]$ and min impurity decrease in $\{0.01,0.1\}$?¶

😱 Grid Search on a "4D-cube" of hyper-params¶

Method for searching¶

Grid Search¶

Randomized Parameter Optimization¶

Searching for optimal parameters with successive halving¶

Artificial Intelligence and Machine Learning¶

Unit II¶

Evaluation Metrics¶

OPIS for Course Evaluation¶

New OPIS Code: 2339RD05¶

Before moving to metrics, let's see hypothetical question about Decision trees in the exam¶

Decision Tree Sample Question in the Exam¶

Decision Tree Sample Question in the Exam¶

Sketch of Solution¶

Sketch of Solution¶

Sketch of Solution¶

Sketch of Solution¶

Sketch of Solution¶

We are done only for 1/3 of the features.¶

Now we have to keep repeating this for the other 2 features and then split on the feature with the maximum IG </ins>¶

Fast Forward¶

So now the problem is:¶

Split Overcooked pasta == Yes¶