As a platform grows and harbors more content, a search tool becomes ever more important for users to navigate.
Since we're using Firestore, we first investigated whether fuzzy text search could be accomplished simply through Firestore queries.
Well, Firestore's own docs has a page dedicated to full-text search which essentially says "Use Algolia".
The problem with Algolia is that it's quite expensive for something that we thought shouldn't really be so complicated.
Besides, it would really be more convenient to just use Firestore, since that's where our data is. Having to bridge that data over to some external service just to have a search feature didn't feel right.
Lucky for us, our use case was rather simple. We didn't need full-text search. We only needed some good-enough, fuzzy search method for game titles and studio names, i.e. very short text strings.
So we continued looking.
As an alternative to server-side searching, we could do the whole thing client-side using FlexSearch. However, this has obvious scaling issues. We couldn't expect users to download a search index for the entire platform.
Well, could the index be split up, then? We could split it through some similarity-heuristic and only fetch specific portions on-demand. But that venture started to sound really complicated for something that we thought shouldn't really be so complicated.
So, client-side search wasn't viable. At the end of the day, if we really wanted to use Firestore for this, the question would become two-fold:
In other words, we needed to get our hands dirty and learn a thing or two about text-indexing, because a ready-to-use solution for Firestore couldn't be found.
Luckily, FlexSearch is very well-documented.
In short, indexing a text string is a two-phase process: encode and tokenize.
First, the string is encoded into a simpler form. For example, encoding
Macaroni Studios yields
mkrm sts. This is useful because it adds fuzziness.
By encoding the string misspellings are tolerated.
This particular encoding is called
extra by FlexSearch and is the most extreme one they provide out-of-the-box.
Second, the string is tokenized. In other words, it is split up in different combinations. There are different tokenizers, but we settled on the
forward tokenizer because the
full one added way too much fuzziness. Tokenizing
mkrm sts using the
forward tokenizer yields
mrkm st and
As you can see, the longer the string the more tokens we get. Good thing we encode the string first!
But since the strings are relatively short, we should be able to use this in Firestore...
Since our index consists of an array of strings, we're probably looking for the array-contains query operator. So, step by step:
Macaroni Studiosis stored in a doc.
makaroniis encoded into
fb.collection("searchDocuments").where("index", "array-contains", "mkrm").
And there you go. Simple fuzzy search using only Firestore and FlexSearch.
However, while this approach works, we noticed some issues that led us to further improvements.
While the encoder provides some compression and fuzziness, it's not enough, and using forward tokenization doesn't really add much fuzziness. The user still has to type their search query in a strict order, i.e. always from the beginning.
Let's say the user is looking for a game called
Attack and Run Helicopter, but only remembers the
Run Helicopter portion. If the user types that in, no results will be found. So we realized that the tokenizer needs to take words into account.
We solved this by splitting the string into words, forward-tokenizing each word, and then join every possible sorted combination of the resulting tokens. For example, tokenizing
aa bb cc would yield the following tokens:
aa bb c
aa bb cc
Since we are sorting the words, tokenizing
cc bb aa would yield the same result. The words in the user's query is also sorted.
Success! Now the user can type any word in any order and the query will match. Essentially, the query matches every entry that contains words that begin with the same characters as the words in the query, in arbitrary word-order.
Immediately, one begins to think about how the token count may explode for longer strings. After improving the tokenizer, the token count for the simple case of
aa bb cc increased from 6 to 12. However,
Attack and Run Helicopter example goes from 22 to 314 tokens.
As previously mentioned, most studio names and game titles aren't very long, so 314 tokens isn't that much for a title that's probably among the longer ones.
But can Firestore handle this amount of elements for
array-contains queries with good performance? According to the Firestore developer team, the answer seems to be yes. The hard cap lies at 20000 elements.
We haven't noticed any performance issues either. The results are fetched very fast, as you can see in the video above.