elasticsearch shards and small data

In Elasticsearch, IDF values are calculated per shard (the article is glibly titled “Relevance is Broken”). The docs stress that this isn’t a cause for concern:

In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.

I think that this is mostly true. In a database of products, movies, or users, the cross-shard discrepancies in IDF should come out in the wash. But you don’t always get to pick your corpus, and some corpora are both big enough to require search, but small enough that cross-shard variance in IDF values could give rise to weird behaviour.

If you’re working with a corpus that easily fits into a single shard (or writing tests, or any other small-data scenario), be sure to add number_of_shards=1 to your index settings. It’s important to do this explicitly, because the default value is 5.

Published by Dave Fernig

data@shopify

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: