I came across an interesting problem on the website of UofT prof David Liu (who I had the privilege of TAing for a few years back). Part of what makes the problem amusing—I think—is how we respond to it. Continue reading “literal probability neglect”
the theoretical urn
In physics and chemistry, better instruments and larger sample sizes raise the bar for making accurate predictions. But in psychology, software usability testing, and biomedical science—any discipline that relies heavily on significance testing—they have the opposite effect. continue reading
construct validity
Suppose we hypothesize that louder background noises make people more irritable (shout out to the kid who sat next to me on the flight last week — you’re driving scientific progress). To test our hypothesis, let’s conduct an experiment. continue reading
book review: the undoing project
Michael Lewis
W. W. Norton & Company, 2016
Amazon
Which is more probable:
- This post will get a few hundred pageviews.
-
My friend Franklyn will tweet this post, resulting in a few hundred pageviews.
If you chose the second option I have a book recommendation for you. continue reading
when jakob met eliza
Provocative tweet of the day: continue reading
book review: search patterns
Peter Morville and Jeffery Callender
O’Reilly, 2010
Amazon
Data science projects can (very roughly) be divided into two types. The first is a study, aimed at providing quantitative insights to other business units. These typically involve building reports, calculating p-values, and answering product managers’ questions. Fifteen years ago the people doing this were called “statisticians” or “data analysts.” The second type of project feeds directly into customer-facing products. Examples include recommender systems, tagging/classification systems, and search engines. continue reading
how to highlight autocomplete
Apologies to designers and UX researchers, to whom this is probably pretty boring. Like most content on this blog, this post is aimed at data scientists.
The following question caught a couple of my colleagues off guard: If you type “lisb” into Google, how are the predicted searches rendered? continue reading
elasticsearch shards and small data
In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern: continue reading
solving SAT in python
SAT is hard, but there are algorithms that tend to do okay empirically. I recently learned about the Davis-Putnam-Logemann-Loveland (DPLL) procedure and rolled up a short Python implementation. continue reading
active learning
Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random sample. continue reading
building a term-document matrix in spark
A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences. continue reading
visualizing piero scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website. continue reading
the lowest form of wit: modelling sarcasm on reddit
A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm. continue reading