building a term-document matrix in spark

A year-old stack overflow question that I'm able to answer? This is like spotting Bigfoot. I'm going to assume access to nothing more than a spark context. Let's start by parallelizing some familiar sentences. The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, … Continue reading building a term-document matrix in spark

visualizing piero scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet - which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website. I was introduced to the site when a friend began referring … Continue reading visualizing piero scaruffi’s music database

the lowest form of wit: modelling sarcasm on reddit

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and … Continue reading the lowest form of wit: modelling sarcasm on reddit