elasticsearch shards and small data

In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern: In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, […]

solving SAT in python

SAT is hard, but there are algorithms that tend to do okay empirically. I recently learned about the Davis-Putnam-Logemann-Loveland (DPLL) procedure and rolled up a short Python implementation. (I can’t get no) satisfaction A boolean formula is called “satisfiable” if you can assign truth values to the underlying atoms in such a way that the […]

active learning

Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random […]

building a term-document matrix in spark

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences. The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, […]