elasticsearch shards and small data

In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern: In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, […]

building a term-document matrix in spark

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences. The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, […]

the lowest form of wit: modelling sarcasm on reddit

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and […]