At work, I’ve been building predictive models. As I began to iterate on my models (Linear, DNN, Wide and Deep), I discovered that I needed a framework for comparing the models. After doing some reading, I landed on doing k-fold Cross Validation and then comparing the Mean Squared Error distributions of the test sets of the 2 models using the t-test for statistical significance.
Here are some useful snipbits:
Calculating mean squared error with numpy
mean_squared_error = ((A - B) ** 2).mean(axis=0)
A convenience wrapper around Scikit-learn’s KFold.split
from sklearn.model_selection import KFold
def split(pandas_dataframe, n_splits=10):
k_fold = KFold(n_splits=n_splits)
for train_indices, test_indices in k_fold.split(pandas_dataframe):
print("train sz {}, test_sz {}".format(len(train_indices), len(test_indices)))
yield pandas_dataframe.iloc[train_indices], pandas_dataframe.iloc[test_indices]
The t-test for comparing the 2 sets of mean squared errors you get from your 2 k-fold comparisons
from scipy import stats
def print_stat_sig(old_mses, new_mses, old_total_mse=None, new_total_mse=None, label_mean=None):
statistic, pvalue = stats.ttest_rel(old_mses, new_mses)
print(statistic, pvalue)
if pvalue < 0.01:
# Small p-values are associated with large t-statistics.
print('Significant: reject null hypothesis, i.e. there is a statistically significant difference')
print('old mse mean {:.3E}, new mse mean {:.3E}'.format(np.mean(old_mses), np.mean(new_mses)))
if label_mean:
print_diff_and_percent_diff("{:d}-fold ".format(len(old_mses)), label_mean, math.sqrt(np.mean(old_mses)), math.sqrt(np.mean(new_mses)))
if old_total_mse:
print_diff_and_percent_diff("total", label_mean, math.sqrt(old_total_mse), math.sqrt(new_total_mse))
else:
print('Not Significant: null hypothesis cannot be rejected, i.e. these 2 sets of values may have come from the same place')