All posts filed under: Uncategorized

building a term-document matrix in spark

Leave a comment
Uncategorized

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences. The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, we would use a real tokenizer, but for illustrative purposes I’m going to keep it simple and split on spaces. […]

visualizing piero scaruffi’s music database

comments 5
Uncategorized

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website. I was introduced to the site when a friend began referring to certain records as “Scaruffi 7s,” in reference to their ratings in Scaruffi’s music database. One of the oldest components […]

the lowest form of wit: modelling sarcasm on reddit

Leave a comment
Uncategorized

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm. A natural follow-up question is whether we can train a model […]