book review: search patterns

Peter Morville and Jeffery Callender
O’Reilly, 2010

Data science projects can (very roughly) be divided into two types. The first is a study, aimed at providing quantitative insights to other business units. These typically involve building reports, calculating p-values, and answering product managers’ questions. Fifteen years ago the people doing this were called “statisticians” or “data analysts.” The second type of project feeds directly into customer-facing products. Examples include recommender systems, tagging/classification systems, and search engines. continue reading

active learning

Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random sample. continue reading

visualizing piero scaruffi’s music database

Scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website. continue reading

the lowest form of wit: modelling sarcasm on reddit

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm. continue reading