At work, I’ve been building predictive models. As I began to iterate on my models (Linear, DNN, Wide and Deep), I discovered that I needed a framework for comparing the models. After doing some reading, I landed on doing k-fold Cross Validation and then comparing the Mean Squared Error distributions of the test sets of the 2 models using the t-test for statistical significance.

Here are some useful snipbits:

Calculating mean squared error with numpy

`mean_squared_error = ((A - B) ** 2).mean(axis=0)`

A convenience wrapper around Scikit-learn’s KFold.split

from sklearn.model_selection import KFold def split(pandas_dataframe, n_splits=10): k_fold = KFold(n_splits=n_splits) for train_indices, test_indices in k_fold.split(pandas_dataframe): print("train sz {}, test_sz {}".format(len(train_indices), len(test_indices))) yield pandas_dataframe.iloc[train_indices], pandas_dataframe.iloc[test_indices]

The t-test for comparing the 2 sets of mean squared errors you get from your 2 k-fold comparisons

from scipy import stats def print_stat_sig(old_mses, new_mses, old_total_mse=None, new_total_mse=None, label_mean=None): statistic, pvalue = stats.ttest_rel(old_mses, new_mses) print(statistic, pvalue) if pvalue < 0.01: # Small p-values are associated with large t-statistics. print('Significant: reject null hypothesis, i.e. there is a statistically significant difference') print('old mse mean {:.3E}, new mse mean {:.3E}'.format(np.mean(old_mses), np.mean(new_mses))) if label_mean: print_diff_and_percent_diff("{:d}-fold ".format(len(old_mses)), label_mean, math.sqrt(np.mean(old_mses)), math.sqrt(np.mean(new_mses))) if old_total_mse: print_diff_and_percent_diff("total", label_mean, math.sqrt(old_total_mse), math.sqrt(new_total_mse)) else: print('Not Significant: null hypothesis cannot be rejected, i.e. these 2 sets of values may have come from the same place')