Sexy search
The goal of this MR is to solve the search issue #96. Let's assume we have a user with firstname Jean-François
, lastname Du Pont
and nickname Ai'gnan
. Here is a list of search that did not include him previously but now includes him (was and still is case-insensitive):
-
jean françois
(missing -) ; -
jean-francois
(missing ç) ; -
jean francois
(both) ; -
dupont
(space) ; -
françois
(not the start of his name) ; -
aignan
(missing ').
You get it, there are a lot of mistakes that humans can do. It also sorts results by User.last_update
to avoid putting old accounts at the top of common requests (such as firstname-only or lastname-only requests).
How it works
For those who don't know, the search is handled by Xapian (the search backend) through the haystack library which provides a Django-friendly interface to multiple search backends. Xapian maintains kind of a duplicate of the database (only for models against which we want to search something) which is optimised for search operations. Its "models" are called "indexes" (see core.search_indexes.UserIndex
for the user model).
Every time a user is created or modified, it is indexed (through a signal handler) so that Xapian knows about it. For the user search, what is indexed is the string outputted by the core/templates/search/indexes/core/user_auto.txt
template. For our example from above, it looks like this:
jean francois
du pont
aignan
jeanfrancois
dupont
jeanfrancoisdupont
As you can see, unicode is removed. There also are kind-of duplicates with different spacing as we are using an autocomplete algorithm: it searches from the beginning of words.
The one I am not sure about is the last one. Its goal is to allow searching without putting a space between the firstname and lastname. Is this useful?
The prod will have to do a ./manage.py update_index
, not sure it does it in the upgrade script.