elasticsearch shards and small data

In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern:

In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.

I think that this is mostly true. In a database of products, movies, or users, the cross-shard discrepancies in IDF should come out in the wash. But you don’t always get to pick your corpus, and some corpora are both big enough to require search, but small enough that cross-shard variance in IDF values could give rise to weird behaviour.

If you’re working with a corpus that easily fits into a single shard (or writing tests, or any other small-data scenario), be sure to add number_of_shards=1 to your index settings. It’s important to do this explicitly, because the default value is 5.

Published by Dave Fernig