Dave Fernig

the virtue of sloth

Progress isn’t made by early risers. It’s made by lazy men trying to find easier ways to do something.

— Robert Heinlen

Yesterday a friend came to me with a problem. He and his colleagues regularly make use of articles posted on a website. It’s a useful site, but it has two shortcomings. continue reading

experimental controls as opportunity cost

A recent press release from NYU celebrates the release of some new video games that “train your brain.” The games were developed by “developmental psychologists, neuroscience researchers, learning scientists, and game designers,” so you can be sure they’re thrilling. There are two issues I want to discuss. The first is their choice of control. The article is paywalled so I don’t know all the details of the experimental setup, but the abstract gives you a fairly good idea: continue reading

lessons from the cuckoo’s egg

Cliff Stoll
Doubleday, 1989
Amazon

There was a time before we defaulted to locking our doors. In the 1980s, the nascent internet was mostly used by research scientists and the military. The community was small, and—as tends to be the case in small communities—the level of trust was high. Administrators didn’t invest much effort in locking down their systems, because the possibility of bad actors hadn’t been seriously considered. continue reading

literal probability neglect

I came across an interesting problem on the website of UofT prof David Liu (who I had the privilege of TAing for a few years back). Part of what makes the problem amusing—I think⁠—is how we respond to it. continue reading

the theoretical urn

In physics and chemistry, better instruments and larger sample sizes raise the bar for making accurate predictions. But in psychology, software usability testing, and biomedical science—any discipline that relies heavily on significance testing—they have the opposite effect. continue reading

construct validity

Suppose we hypothesize that louder background noises make people more irritable (shout out to the kid who sat next to me on the flight last week — you’re driving scientific progress). To test our hypothesis, let’s conduct an experiment. continue reading

book review: the undoing project

Michael Lewis
W. W. Norton & Company, 2016
Amazon

Which is more probable:

This post will get a few hundred pageviews.
My friend Franklyn will tweet this post, resulting in a few hundred pageviews.

If you chose the second option I have a book recommendation for you. continue reading

when jakob met eliza

Provocative tweet of the day: continue reading

book review: search patterns

Peter Morville and Jeffery Callender
O’Reilly, 2010
Amazon

Data science projects can (very roughly) be divided into two types. The first is a study, aimed at providing quantitative insights to other business units. These typically involve building reports, calculating p-values, and answering product managers’ questions. Fifteen years ago the people doing this were called “statisticians” or “data analysts.” The second type of project feeds directly into customer-facing products. Examples include recommender systems, tagging/classification systems, and search engines. continue reading

how to highlight autocomplete

Apologies to designers and UX researchers, to whom this is probably pretty boring. Like most content on this blog, this post is aimed at data scientists.

The following question caught a couple of my colleagues off guard: If you type “lisb” into Google, how are the predicted searches rendered? continue reading

elasticsearch shards and small data

In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern: continue reading

solving SAT in python

SAT is hard, but there are algorithms that tend to do okay empirically. I recently learned about the Davis-Putnam-Logemann-Loveland (DPLL) procedure and rolled up a short Python implementation. continue reading

active learning

Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random sample. continue reading

building a term-document matrix in spark

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences. continue reading

visualizing piero scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website. continue reading

the lowest form of wit: modelling sarcasm on reddit

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm. continue reading