Comparing Models in python

At work, I’ve been building predictive models. As I began to iterate on my models (Linear, DNN, Wide and Deep), I discovered that I needed a framework for comparing the models. After doing some reading, I landed on doing k-fold Cross Validation and then comparing the Mean Squared Error distributions of the test sets of the 2 models using the t-test for statistical significance.

Here are some useful snipbits:

Calculating mean squared error with numpy

mean_squared_error = ((A - B) ** 2).mean(axis=0)

A convenience wrapper around Scikit-learn’s KFold.split

from sklearn.model_selection import KFold

def split(pandas_dataframe, n_splits=10):
    k_fold = KFold(n_splits=n_splits)
    for train_indices, test_indices in k_fold.split(pandas_dataframe):
        print("train sz {}, test_sz {}".format(len(train_indices), len(test_indices)))
        yield pandas_dataframe.iloc[train_indices], pandas_dataframe.iloc[test_indices]

The t-test for comparing the 2 sets of mean squared errors you get from your 2 k-fold comparisons

from scipy import stats

def print_stat_sig(old_mses, new_mses, old_total_mse=None, new_total_mse=None, label_mean=None):
    statistic, pvalue = stats.ttest_rel(old_mses, new_mses)
    print(statistic, pvalue)
    if pvalue < 0.01:
        # Small p-values are associated with large t-statistics.
        print('Significant: reject null hypothesis, i.e. there is a statistically significant difference')
        print('old mse mean {:.3E}, new mse mean {:.3E}'.format(np.mean(old_mses), np.mean(new_mses)))
        if label_mean:
            print_diff_and_percent_diff("{:d}-fold ".format(len(old_mses)), label_mean, math.sqrt(np.mean(old_mses)), math.sqrt(np.mean(new_mses)))
            if old_total_mse:
                print_diff_and_percent_diff("total", label_mean, math.sqrt(old_total_mse), math.sqrt(new_total_mse))
        print('Not Significant: null hypothesis cannot be rejected, i.e. these 2 sets of values may have come from the same place')


Hadoop finally hits home

Big Data has finally produced results for my team at work. Other parts of the business have been working on transforming our company wide data lake into a structured, post business rule cache in Hadoop. Our team is now able to do terabyte scale transformations on this dataset in hours instead of weeks. We’re also able to delete code and reimagine processes. It’s been a long time in coming but we’re finally there. Thanks to the open source community for delivering the goods: Hadoop, Cassandra, and MongoDB.


Deploying Multiple Grails Apps to Tomcat7

Lately I pushed my second grails app to my production tomcat server. The second app failed to startup. /var/lib/tomcat7/logs/catalina.out gave a long list of errors the first of which was…

 Database may be already in use: “Locked by another process”. Possible solutions: close all other connection(s); use the server mode [90020-173].

It turns out that the databases were named the same in the grails-app/conf/DataSource.groovy file.

The fix is to make the names of the production databases to be meaningful and thus different. For instance, change  the bold part to something different.


environment {
  production { 
    dataSource { 
    dbCreate = "update" 
    url = "jdbc:h2:prodDb;MVCC=TRUE;LOCK_TIMEOUT=10000;DB_CLOSE_ON_EXIT=FALSE" 

Of course after this, you’ll need to redeploy (you’ll need to replace *.war with the specific version you want to deploy).

path/to/your/grails/project $ grails war
$ cp target/*.war /var/lib/tomcat7/webapps
path/to/OTHER/grails/project $ grails war
$ cp target/*.war /var/lib/tomcat7/webapps
$ sudo /etc/init.d/tomcat7 restart

Apache2 proxy for Tomcat7


Tomcat runs on port 8080 by default

You’ll put the following 2 lines in your /etc/apache2/sites-available/ file

ProxyPass /logger http://localhost:8080/logger
ProxyPassReverse /logger http://localhost:8080/logger

Make sure that the folder and the tomcat deployment are the same name or it won’t serve static content correctly.



MaxClients in Apache2 is a canary for other vhost logging misconfiguration

I first noticed a problem when the browser started timing out with the message “Establishing Secure Connection”.

I checked Apache’s log at /var/log/apache2/error.log and found the following

[error] server reached MaxClients setting, consider raising the MaxClients setting

I knew at 12:30 am I wasn’t getting that many legitimate users. I noted my other_vhost_access.log were getting huge and showing the following spam. – – [29/Mar/2014:00:16:50 -0500] “GET HTTP/1.0” 200 0 “” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv: Gecko/20100824 Firefox/3.6.9” – – [29/Mar/2014:00:16:50 -0500] “GET[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG] HTTP/1.0” 302 0 “” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; en) Opera 11.00”

I disabled logging by commenting out the line in /etc/apache2/conf.d/other_vhost_access_log.




A meme is a picture with words on it?

ES: Our quality of life depends on recognizing we are agents of cultural adaptation. What we say and do matters. Looks can be deceiving, we are stronger than we think.

I recently overheard a coworker say that “A meme is a picture with words on it.” Understated. Every idea we have, word we speak, and action we perform has the potential to go viral just as this media format has. A meme loosely refers concepts that live in the mind or behaviors  that spread for whatever reason.

The word meme was patterned after gene, the basic unit of inheritance on the biological level. I guess changing the ‘g’ to a ‘m’ signaled that this was about the mind. Ask Richard Dawkins who coined the term. Genes and memes are similar but there is one huge important difference that is at the heart of my mission here.

Genes and memes spread: Some ideas spread with wild success. Malcolm Gladwell, in his book the Tipping Point, documents this. Chip and Dan Heath offer a framework for summarizing what makes ideas memorable in their book Made To Stick. As a member of a 7 billion strong human population, our genes must match our environment.

Genes and memes affect our behavior. Genes are our nature. Memes are our nurture. There is no sense arguing over which one has more of an effect on our behavior. They both do. My point is that we have conscious control over one: Memes.

Self-reflection and self-direction, consciousness and conscientiousness, mindfulness and right action: these are the stuff of memes. More than just funny pictures, they make this world a prison or paradise.


How to Learn How to Dev

Learning software development takes lots of time and grit.  In other words, it takes passion. The best motivation is to see your personal projects working. However, I have learned that diving in to building your project is not the fastest road to success. It is better to learn about how to use existing software that does similar things and then bridge that gap.

Of course, it is best to have a mentor to help you find software similar to what you imagine, but I’ll assume that if you’re reading this post it is because you want to learn to be a dev but don’t have a mentor and don’t have any idea where to find one. That’s the position I was in.

  • Google around for popular technologies that you can be gainfully employed at. If lots of people like to use it and make money doing so, chances are you can to.
  • Watch tutorial videos to get going quick.
    • Follow along with the tutorial on your own machine. Reproduce what they did. Only make modifications once you have the replication working exactly as expected.
    • starts at very basic level, interactive coding assignments give excellent immediate feedback
    • best if you are already a technical user trying to learn a different tech stack.
  • Google errors. Look for Stack Overflow answers that are highly rated or check-marked (as the answer)
  • Have interesting projects that push you to learn new technologies and get them working.
  • If you have been beating your head against a frustration for several hours, change your context. Do something else. If you are tired or it is late, go to sleep. Live to fight another day. Chances are that you’ll solve it in 15 minutes when you wake up tomorrow.
  • Listen to developer podcasts to hear what’s popular, important or perilous.
  • Find a mentor or dev community. Go to local users groups. Bounce ideas off other people.

Why devs blog IMHO

Here are some reasons why I think dev’s blog or at least ought to. The last is the most important

  1. Share ideas with a much broader audience than in an email list or chat room
  2. Establish themselves as knowledgable/skillful in a particular area
  3. Catalog problems/solutions for personal quick reference

The faintest ink is better than the best memory. As a full stack dev using multiple languages and frameworks, I often find myself in the situation of needing to re-figure out what I have already discovered. If it is not possible automate the problem away (viz. make executable documentation) then a blog post documenting the problem and solution are the best products you can give yourself and the world.

Apache virtual hosts setup needs proper A DNS record

When I was first setting up SSL with Apache, I discovered apache was sending me to a different virtual host when I typed in the www. or left it off.

<VirtualHost *:443>

I eventually learned that I needed to change my configuration to..

ServerAlias *

But the most important part was making sure I had both a www and a blank/@ DNS A record or CNAME record on my domain registration site.

I won’t make that mistake again!

Why I use emacs

To be honest, I first started learning it as a way to get used to moving my cursor around the shell to edit and search for shell commands. A Linode tutorial mentioned they were largely the same commands so doubling up on them seemed to me to be the fastest way to master them. Mac’s also use some basic shell commands almost universally (like Control-a for home, C-e for end, C-b for back one space and C-f for forward one space). Finally, most well developed editors have Emacs keybinding support to some extent so you can be productive in those environments immediately.

Emacs was initially frustrating because I had the expectation that software should be usable without having to read tutorials or documentation. When I was unable to exit out of emacs after trying numerous key combinations, I finally broke down and started the tutorial. It turns out that once you know how to save (Control-x C-s) and exit (C-x C-c), it pretty much acts like a normal editor with arrow keys and enter all doing what you expect.

The barrier to entry is even lower on the GUI version which has menus you can click around. The menus even explain what the shortcuts are so you can eventually give up the mouse entirely.

One particular feature I love about emacs is its macro recording system. I can’t tell you how many times keyboard macros have saved me lots of time. They are a lot easier than hacking together a text processing script for a one off edit. Pressing F3 starts recording, F4 stops recording and F4 also replays the recorded key sequence. Meta(alt)-[number] F4 repeats the command a number of times.

Despite all these benefits, I continued to sit the fence between vim, emacs, and gui editors like eclipse until I heard about org-mode. I had been looking for way to quickly organize and maintain large amounts of information for a long time. Org-mode is that and more. Expanding and collapsing information, organizing it hierarchically, instantly searching and navigating a huge corpus of notes, moving nodes from one parent to another, and much more can all easily be done in one or 2 keystrokes. Once you are done organizing the information you can export it to many different formats (especially using pandoc). Org-mode, great as it is, is only one example of Emacs’ killer features that made me switch whole hog – Extensibility and Community.

But these features are also a double edge swords. Its extensibility makes learning how to make changes a difficult process for the uninitiated. Its community is so focused on customizability that the defaults that are shipped are not sufficient to enable productivity. However, this is nothing a good video tutorial can’t fix. And pair that with some googling skills and you’re off to the races.

Here are some video tutorials to get you started: