Many websites and blogs are statically generated and the webserver only serves static files. It is the case of many doc websites and more and more blogs, starting from this one, as Jekyll / Pelican develops.
This is really useful to reduce the complexity of the website and the load on the webserver. All the complex logic is done at the generation.
However, this also means you do not have dynamic pages on your website to handle search queries. Then, you are left with two (or three) choices :
- Use an external search engine, such as an embedded Google search box. This raises some privacy concerns and make you depends on an external service.
- (Use a JS search engine such as the filters provided by Angular.JS. This only works on the displayed content, and is not a real solution).
- Stop worrying about search engine on your website and let the users
grep-ing your website on their computers. This is not the most user-friendly solution…
There are a couple of solutions around, mostly based on Lunr.js which generates an index from the articles available, and use this index for fulltext search. This is the best solution I found so far but it is still not perfect. Although there is a stemmer and an index generation to reduce the amount of data to be transferred, the data is not stored in a very efficient way, and the full index is sent as JSON. An example implementation for Jekyll is available through the jekyll-lunr-js-search plugin.
I had the idea of a client side search engine in mind for a while, but was facing the same problem as Lunr.js: how not to send a full (very large) index over the network to every single client ? Not having an optimized data structure would basically mean sending twice the content of the website to the client. It may not be a practical problem nowadays, as transfer speed is not always the limiting resource, but it is still not to be considered as a good practice, in my opinion, especially if your website might be accessed from mobile devices.
I came accross this article from Stavros Korokithakis and thought something similar could be achieved directly in the browser. Instead of using a standard dictionary to store the index, this article proposes to use a Bloom filter per article. Bloom filters are very interesting probabilistic structures which can store whether an element is or not in a set, with a fixed number of bits. It can return false positives: if an element is in the set, it always returns
True, but if an element is not in the set, it may say it is actually in, with a small probability. Wikipedia page on the subject has all the necessary stuff to understand these data structures.
A demo is available here. It contains all the articles of my blog, as of writing this article, totalizing 160k characters, and only 7kB of index, allowing 10% of false positives, which may be a bit too much for a really reliable search engine. Reducing the error rate will lead to an increase in the index size (11kB for 1% of false positives and the same amount of characters).
Details of the implementation
capacity and an
error_rate, instead of an explicit number of bits and times to apply the hashing function. This forked version is available here.
The generation script takes every articles in a given directory and for each of them:
- It gets a set of all the words in this article, ignoring too short words.
- It applies Porter Stemming Algorithm to reduce drastically the number of words to keep.
- It generates a Bloom filters containing all of these words.
Finally, it concatenates all the per article Bloom filters in a binary file, to be sent to the client. It also generates a JSON index mapping the id of the Bloom filter in the binary file to the corresponding URL and title for each article.
(Fun) facts found while reimplementing the Bloom filters library in Python
Number type and Python
floats) and it is
Number (see this SO thread). They are 64-bit floating point values, with a magnitude no greater than 253. However, when doing bitwise operations, they are casted to 32 bits before doing the operation. This is something to take care of, because Python’s
The solution to this problem was to use
ctypes.c_int in Python for bitwise operations, as proposed here.
Finally, there was no “shift right adding zeros” (logical right shift) in Python, contrary to JS, see this SO thread.