Proof-of-concept: BloomySearch, a (JavaScript) client-side search engine for static websites

Posted on November 08, 2014 in Dev • 5 min read

Overview

Many websites and blogs are statically generated and the webserver only serves static files. It is the case of many doc websites and more and more blogs, starting from this one, as Jekyll / Pelican develops.

This is really useful to reduce the complexity of the website and the load on the webserver. All the complex logic is done at the generation.

However, this also means you do not have dynamic pages on your website to handle search queries. Then, you are left with two (or three) choices :

  1. Use an external search engine, such as an embedded Google search box. This raises some privacy concerns and make you depends on an external service.
  2. (Use a JS search engine such as the filters provided by Angular.JS. This only works on the displayed content, and is not a real solution).
  3. Stop worrying about search engine on your website and let the users wget-ing and grep-ing your website on their computers. This is not the most user-friendly solution…

There are a couple of solutions around, mostly based on Lunr.js which generates an index from the articles available, and use this index for fulltext search. This is the best solution I found so far but it is still not perfect. Although there is a stemmer and an index generation to reduce the amount of data to be transferred, the data is not stored in a very efficient way, and the full index is sent as JSON. An example implementation for Jekyll is available through the jekyll-lunr-js-search plugin.

I had the idea of a client side search engine in mind for a while, but was facing the same problem as Lunr.js: how not to send a full (very large) index over the network to every single client ? Not having an optimized data structure would basically mean sending twice the content of the website to the client. It may not be a practical problem nowadays, as transfer speed is not always the limiting resource, but it is still not to be considered as a good practice, in my opinion, especially if your website might be accessed from mobile devices.

I came accross this article from Stavros Korokithakis and thought something similar could be achieved directly in the browser. Instead of using a standard dictionary to store the index, this article proposes to use a Bloom filter per article. Bloom filters are very interesting probabilistic structures which can store whether an element is or not in a set, with a fixed number of bits. It can return false positives: if an element is in the set, it always returns True, but if an element is not in the set, it may say it is actually in, with a small probability. Wikipedia page on the subject has all the necessary stuff to understand these data structures.

I wrote it in the context of my blog, which means a Python script to generate the index at pages generation, and a client side search engine in JavaScript, running in browser.

A demo is available here. It contains all the articles of my blog, as of writing this article, totalizing 160k characters, and only 7kB of index, allowing 10% of false positives, which may be a bit too much for a really reliable search engine. Reducing the error rate will lead to an increase in the index size (11kB for 1% of false positives and the same amount of characters).

Details of the implementation

As JavaScript is not the easier language to use for hashing and binary data manipulation, I started by implementing the client side search engine. Then, it would be easier to adapt the Python code to the JS lib than doing the contrary. Actually, I found this bloomfilters.js library from Jason Davies which was doing most of the job and did not need many modifications. I edited it a bit to support a construction with a capacity and an error_rate, instead of an explicit number of bits and times to apply the hashing function. This forked version is available here.

Then, I reimplemented this library in Python, to generate readable Bloom filters for the JavaScript script.

Server side

The generation script takes every articles in a given directory and for each of them:

  1. It gets a set of all the words in this article, ignoring too short words.
  2. It applies Porter Stemming Algorithm to reduce drastically the number of words to keep.
  3. It generates a Bloom filters containing all of these words.

Finally, it concatenates all the per article Bloom filters in a binary file, to be sent to the client. It also generates a JSON index mapping the id of the Bloom filter in the binary file to the corresponding URL and title for each article.

Client side

Upon loading, the JavaScript script downloads the binary file (see this MDN doc for more details) containing the Bloom filters and the JSON index, and regenerate BloomFilters on the client side.

When the client searches for something, the JavaScript script splits the query in words and iterate over the Bloom filters to search for the words. That’s it =)

(Fun) facts found while reimplementing the Bloom filters library in Python

First problem I had to deal with : the difference between JavaScript Number type and Python int. JavaScript has only one type for all numbers (int or floats) and it is Number (see this SO thread). They are 64-bit floating point values, with a magnitude no greater than 253. However, when doing bitwise operations, they are casted to 32 bits before doing the operation. This is something to take care of, because Python’s int can be 64 bits (http://legacy.python.org/dev/peps/pep-0237/). Then, when a bitwise operation overflows in JavaScript, it may not overflow the same way in Python.

The solution to this problem was to use ctypes.c_int in Python for bitwise operations, as proposed here.

Another problem was the difference between modulo behaviour with negative numbers in Python and in JavaScript. Unlike C, C++ and JavaScript, Python’s modulo operator (%) always return a number having the same sign as the divisor (Source). Then, we have to reimplement the C behaviour in a modulo function in Python.

Finally, there was no “shift right adding zeros” (logical right shift) in Python, contrary to JS, see this SO thread.