Simple Fuzzy Search with Firestore and FlexSearch

Mar 26, 2020by Mikael Hernvall

Automatic compression

As a platform grows and harbors more content, a search tool becomes ever more important for users to navigate.

Here's how we solved it using Firestore and FlexSearch.

"Use Algolia" They Said

Since we're using Firestore, we first investigated whether fuzzy text search could be accomplished simply through Firestore queries.

Well, Firestore's own docs has a page dedicated to full-text search which essentially says "Use Algolia".

The problem with Algolia is that it's quite expensive for something that we thought shouldn't really be so complicated.

Besides, it would really be more convenient to just use Firestore, since that's where our data is. Having to bridge that data over to some external service just to have a search feature didn't feel right.

Lucky for us, our use case was rather simple. We didn't need full-text search. We only needed some good-enough, fuzzy search method for game titles and studio names, i.e. very short text strings.

So we continued looking.

FlexSearch

As an alternative to server-side searching, we could do the whole thing client-side using FlexSearch. However, this has obvious scaling issues. We couldn't expect users to download a search index for the entire platform.

Well, could the index be split up, then? We could split it through some similarity-heuristic and only fetch specific portions on-demand. But that venture started to sound really complicated for something that we thought shouldn't really be so complicated.

So, client-side search wasn't viable. At the end of the day, if we really wanted to use Firestore for this, the question would become two-fold:

  1. How do we index the text strings?
  2. How do we query the index?

In other words, we needed to get our hands dirty and learn a thing or two about text-indexing, because a ready-to-use solution for Firestore couldn't be found.

Luckily, FlexSearch is very well-documented.

Indexing

In short, indexing a text string is a two-phase process: encode and tokenize.

First, the string is encoded into a simpler form. For example, encoding Macaroni Studios yields mkrm sts. This is useful because it adds fuzziness.

By encoding the string misspellings are tolerated.

This particular encoding is called extra by FlexSearch and is the most extreme one they provide out-of-the-box.

Second, the string is tokenized. In other words, it is split up in different combinations. There are different tokenizers, but we settled on the forward tokenizer because the full one added way too much fuzziness. Tokenizing mkrm sts using the forward tokenizer yields m, mk, mkr, mkrm, mkrm s, mrkm st and mrkm sts.

As you can see, the longer the string the more tokens we get. Good thing we encode the string first!

But since the strings are relatively short, we should be able to use this in Firestore...

Querying

Since our index consists of an array of strings, we're probably looking for the array-contains query operator. So, step by step:

  1. The index of Macaroni Studios is stored in a doc.
  2. The user's search query makaroni is encoded into mkrm.
  3. We query by fb.collection("searchDocuments").where("index", "array-contains", "mkrm").

And there you go. Simple fuzzy search using only Firestore and FlexSearch.

However, while this approach works, we noticed some issues that led us to further improvements.

Improving the Tokenizer

While the encoder provides some compression and fuzziness, it's not enough, and using forward tokenization doesn't really add much fuzziness. The user still has to type their search query in a strict order, i.e. always from the beginning.

Let's say the user is looking for a game called Attack and Run Helicopter, but only remembers the Run Helicopter portion. If the user types that in, no results will be found. So we realized that the tokenizer needs to take words into account.

We solved this by splitting the string into words, forward-tokenizing each word, and then join every possible sorted combination of the resulting tokens. For example, tokenizing aa bb cc would yield the following tokens:

a aa aa b aa bb aa bb c aa bb cc

b bb bb c bb cc

c cc

Since we are sorting the words, tokenizing cc bb aa would yield the same result. The words in the user's query is also sorted.

Success! Now the user can type any word in any order and the query will match. Essentially, the query matches every entry that contains words that begin with the same characters as the words in the query, in arbitrary word-order.

Scalability

Immediately, one begins to think about how the token count may explode for longer strings. After improving the tokenizer, the token count for the simple case of aa bb cc increased from 6 to 12. However, Attack and Run Helicopter example goes from 22 to 314 tokens.

As previously mentioned, most studio names and game titles aren't very long, so 314 tokens isn't that much for a title that's probably among the longer ones.

But can Firestore handle this amount of elements for array-contains queries with good performance? According to the Firestore developer team, the answer seems to be yes. The hard cap lies at 20000 elements.

We haven't noticed any performance issues either. The results are fetched very fast, as you can see in the video above.

Conclusion

That's it! Simple fuzzy search using no external services like Algolia. Only Firestore and FlexSearch.

Read more posts at our devblog or follow us on Twitter.