What do you do when you need to filter a long list of records for your users?

We love and use Django so, most of the time, it is just as simple as:

>>> Entry.objects.filter(title='Man bites dog')

Well, actually that's not very useful. If you don't provide a lookup type it is assumed to be exact, which is not that user friendly. That query would miss 'Man Bites Dog' or "Man Bites Dog's Tail" and many more. That's why one of the most used field lookups is icontains which roughly translates to this SQL1:

SELECT ... WHERE title ILIKE '%Man bites dog%';

That is much better. It works most of the time and you get to keep your comfy ORM. But what if you are looking for an Entry titled 'Man Bites Dogs Tails'? You would still miss it! And what about looking up the Entry's body too?

Enter PostgreSQL's Full Text Search.

Rather than using regular expressions to tackle this problem, we can use the full text search facilities that PostgreSQL offers. From the manual's introduction:

Textual search operators have existed in databases for years. PostgreSQL has ~, ~*, LIKE, and ILIKE operators for textual data types, but they lack many essential properties required by modern information systems:

  • There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g., satisfies and satisfy. You might miss documents that contain satisfies, although you probably would like to find them when searching for satisfy. It is possible to use OR to search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives).
  • They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
  • They tend to be slow because there is no index support, so they must process all documents for every search.

One of the biggest advantages of using text search is that it can return matches for derived words. Our first example could be written like:

>>> Entry.objects.search('man is biting a tail')
[<Entry: Man Bites Dogs Tails>]

Documents and Queries

What we want is to query our database for documents.

A document is the unit of searching in a full text search system. Documents can be whatever you want them to be. In our case, we can consider a concatenation of title and body columns of entry records a single document.

To keep search fast and efficient, the database uses a compact representation of a document for matching queries, the tsvector. That is, the raw contents of the document preprocessed in a special format, with stop words removed and words reduced to their essential forms.

To query documents we should use a tsquery. These contains search terms which, like a tsvector, must be already-normalized lexemes. Queries may combine multiple terms using AND, OR, and NOT operators.

Text search works by matching a tsvector to a tsquery.

To use them in a query, just use to_tsvector and to_tsquery. We could now write our query like this:

SELECT ... WHERE to_tsvector(COALESCE(title, '') || ' ' || COALESCE(body, '')) @@ to_tsquery('man & bites & dog');

This returns just what we want.

Note that we are using COALESCE to handle possible NULL values, removing them. Also note the form of our tsquery: every word has to be separated with an operator. A simple string like 'man bites dog' would throw a syntax error.

An alternative would be to use plainto_tsquery which takes an arbitrary string, converts it to lexemes and then ANDs them together. Using plainto_tsquery is not a general solution to all search problems but it is enough for now.

A Word on Stemming

To convert the search terms and document text into lexemes, PostgreSQL uses stemming dictionaries which are super helpful if you need to process multilingual data.

To get a list of the installed dictionaries in your PostgreSQL install, use \dFd:

=> \dFd
                             List of text search dictionaries
   Schema   |      Name       |                        Description
------------+-----------------+-------------------------------
 pg_catalog | danish_stem     | snowball stemmer for danish language
 pg_catalog | dutch_stem      | snowball stemmer for dutch language
 pg_catalog | english_stem    | snowball stemmer for english language
 pg_catalog | finnish_stem    | snowball stemmer for finnish language
 pg_catalog | french_stem     | snowball stemmer for french language
 pg_catalog | german_stem     | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem    | snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem    | snowball stemmer for russian language
 pg_catalog | simple          | simple dictionary: just lower case and check for stopword
 pg_catalog | spanish_stem    | snowball stemmer for spanish language
 pg_catalog | swedish_stem    | snowball stemmer for swedish language
 pg_catalog | turkish_stem    | snowball stemmer for turkish language

Most of the text search functions take a language configuration (or a stemmer) name as their first argument.

Stemming works very differently for each language and this can bite you down the road if you are not careful. To test how a word is stemmed:

=> SELECT ts_lexize('english_stem', 'programming');
 ts_lexize 
-----------
 {program}
(1 row)

=> SELECT ts_lexize('spanish_stem', 'programming');
   ts_lexize   
---------------
 {programming}
(1 row)

Notice how programming is not stemmed at all using the spanish stemmer. Using wrong language configurations on a query will get you poor results.

Just Scratching the Surface

We have just learned how to search text on PostgreSQL, but there is so much more topics to cover. Creating indexes to speed up text searches, gathering document statistics, ranking and highlighting results, and more.


Keep on reading the second and last post of this series


Footnotes:

1

ILIKE is not part of the SQL standard but a PostgreSQL extension. You are using PostgreSQL, right? Right?