If you missed it, read the first post of this series


What do you do when you need to filter a long list of records for your users?

That was the question we set to answer in a previous post. We saw that, for simple queries, built-in filtering provided by your framework of choice (think Django) is just fine. Most of the time, though, you'll need something more powerful. This is where PostgreSQL's full text search facilities comes in handy.

We also saw that just using to_tsvector and to_tsquery functions goes a long way filtering your records. But what about documents that contain accented characters? What can we do to optimize performance? How do we integrate this with Django?

Hola, Mundo!

We have found that the need to search documents in multiple languages is fairly common. You can query your data using to_tsquery without passing a language configuration name but remember that, under the hood, the text search functions always use one.

The default language is english, but you have to use the right language stemmer according to your document language or you might not get any matches.

If, for example, we search for física in spanish documents that have this word and its variations we would only see exact matches for this query:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('física');
                                  text                                   
-------------------------------------------------------------------------
 física (aparatos e instrumentos de —)
 física (educación —)
 física (investigación en —)
 rehabilitación física (aparatos de —) para uso médico
 educación física
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(8 rows)

And worse, if we search just for fisica, unaccented:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica');
 text 
------
(0 rows)

To get results that contain a variation of física (físicas, físico, físicamente, etc…) we have to use the right stemmer. Remember that if our stemmer doesn't have a word in its dictionary it won't stem it:

=> SELECT ts_lexize('english_stem', 'programming');
 ts_lexize 
-----------
 {program}
(1 row)

=> SELECT ts_lexize('spanish_stem', 'programming');
   ts_lexize   
---------------
 {programming}
(1 row)

But if we use the right language, it will:

=> SELECT text FROM terms WHERE to_tsvector('spanish', text) @@ to_tsquery('spanish', 'física');
                                      text                                      
--------------------------------------------------------------------------------
 física (aparatos e instrumentos de —)
 ejercicios físicos (aparatos para —)
 entrenamiento físico (aparatos de —)
 físicos (aparatos para ejercicios —)
 física (educación —)
 preparador físico personal [mantenimiento físico] (servicios de —)
 física (investigación en —)
 ejercicio físico (aparatos de —) para uso médico
 rehabilitación física (aparatos de —) para uso médico
 aparatos para ejercicios físicos
 almacenamiento de soportes físicos de datos o documentos electrónicos
 clases de mantenimiento físico
 clubes deportivos [entrenamiento y mantenimiento físico]
 educación física
 conversión de datos o documentos de un soporte físico a un soporte electrónico
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(18 rows)

Working with multiple languages

Of course, we don't want to fill in the language manually every time. A straightforward solution would be to store the record's language in its own column. to_tsvector and friends accept a string to set the language but also a column name. So we could write:

=> SELECT text FROM term WHERE to_tsvector(language, text) @@ to_tsquery(language, 'entrena');
                           text                           
----------------------------------------------------------
 entrenamiento físico (aparatos de —)
 clubes deportivos [entrenamiento y mantenimiento físico]
(2 rows)

The only catch here is that the column must be a regconfig. If you are using South or Django migrations to manage your database schema remember to change the type of your language column:

=> ALTER TABLE terms ALTER COLUMN language TYPE regconfig USING language::regconfig;

This way we can query records in their own language.

Even Better: Accented Characters

But what about accented words? If our users don't carefully type them (most users don't) then they won't find anything.

If we search for the word fisica again, even with all the previous setup, nothing shows up:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica');
 text 
------
(0 rows)

We found this behavior funny (not in a good way). Some words work and some don't. I guess it depends on completeness of the dictionary we choose to use. But we can't depend on the word being on the dictionary. It would be a hit or miss (and, by our experience, more miss) thing.

But don't worry. There is a PostgreSQL extension for that™.

CREATE EXTENSION unaccent;

The unaccent extension can be used to extend the default language configurations to filter out accented characters:

CREATE TEXT SEARCH CONFIGURATION sp ( COPY = spanish );
ALTER TEXT SEARCH CONFIGURATION sp ALTER MAPPING FOR hword, hword_part, word WITH unaccent, spanish_stem;

And, behold! It works great:

=> SELECT text FROM terms WHERE to_tsvector('sp', text) @@ to_tsquery('sp', 'fisica');
                                      text                                      
--------------------------------------------------------------------------------
 física (aparatos e instrumentos de —)
 ejercicios físicos (aparatos para —)
 entrenamiento físico (aparatos de —)
 físicos (aparatos para ejercicios —)
 física (educación —)
 preparador físico personal [mantenimiento físico] (servicios de —)
 física (investigación en —)
 ejercicio físico (aparatos de —) para uso médico
 rehabilitación física (aparatos de —) para uso médico
 aparatos para ejercicios físicos
 almacenamiento de soportes físicos de datos o documentos electrónicos
 clases de mantenimiento físico
 clubes deportivos [entrenamiento y mantenimiento físico]
 educación física
 conversión de datos o documentos de un soporte físico a un soporte electrónico
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(18 rows)

Note that you must be superuser to install this extension on a normal installation (not in Heroku Postgres). If PostgreSQL complains about not finding the extension, install the postgresql-contrib package in your distro.

A Note on Performance

To make text search work well in a large dataset, consider creating a GIN index:

CREATE INDEX terms_idx ON terms USING gin(to_tsvector(language, text));

GIN indexes might take longer time to build than GiST indexes, but lookups are a lot faster.

Wrapping Up: Using It in Django

Finally, to make all this easy to use from Python, we wrote a custom queryset and used django-model-utils to embed it on the model's manager:

from django.db import models
from model_utils.managers import PassThroughManager


class TermSearchQuerySet(models.query.QuerySet):
    def search(self, query, raw=False):
        function = "to_tsquery" if raw else "plainto_tsquery"
        search_vector = "language, text"
        ts_query = "%s(language, '%s')" % (function, query)
        where = "to_tsvector(%s) @@ %s" % (search_vector, ts_query)
        return self.extra(where=[where])


class Term(models.Model):
    text = models.CharField(_('text'), max_length=255)
    language = models.CharField(max_length=20, default='sp', editable=False)

    objects = PassThroughManager.for_queryset_class(TermSearchQuerySet)()

Remember our original example on the first part?

>>> Entry.objects.search('man is biting a tail')
[<Entry: Man Bites Dogs Tails>]