I came across an interesting problem on the website of UofT prof David Liu (who I had the privilege of TAing for a few years back). Part of what makes the problem amusing—I think—is how we respond to it.
Author Archives: Dave Fernig
the theoretical urn
In physics and chemistry, better instruments and larger sample sizes raise the bar for making accurate predictions. But in psychology, software usability testing, and biomedical science—any discipline that relies heavily on significance testing—they have the opposite effect.
construct validity
For a while we were all deluded into believing that we can directly optimize for the “correct” objective — be it clickthrough, revenue, or whatever — without any regard for made-up bullshit that can’t be objectively measured. You know, like “happiness,” or “trust.”
book review: the undoing project
“It was as if he had been assigned to take apart a fiendishly complicated alarm clock to see why it wasn’t working, only to discover that an important part of the clock was inside his own mind.”
when jakob met eliza
There are two ways you can help users become proficient with your product. The first is explicit feedback, in which you literally tell them what to do. Documentation/manuals are a common example. The second is implicit feedback, where the user learns by doing. Implicit feedback is great, but a whole bunch of prerequisites have to be in place in order for it work […]
book review: search patterns
Data science projects can (very roughly) be divided into two types. The first is a study, aimed at providing quantitative insights to other business units. These typically involve building reports, calculating p-values, and […]
how to highlight autocomplete
The following question caught a couple of my colleagues off guard: If you type “lisb” into Google, how are the predicted searches rendered? […]
elasticsearch shards and small data
In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern:
solving SAT in python
SAT is hard, but there are algorithms that tend to do okay empirically. I recently learned about the Davis-Putnam-Logemann-Loveland (DPLL) procedure and rolled up a short Python implementation.
active learning
Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random […]
building a term-document matrix in spark
A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences.
visualizing piero scaruffi’s music database
Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website.
the lowest form of wit: modelling sarcasm on reddit
A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and […]