Implementing a Search Functionality using PostgreSQL
Imagine you are building a search functionality for an e-commerce platform. Users often make typos or misspell product names, leading to poor search results. To improve user experience, you want the search engine to return relevant products even when the search query contains errors. This is where PostgreSQL similarity operations come in handy.
Example
Let’s imagine that the user wrote “samsang glaxy phone” in the search box. Despite the typos, most modern applications would return “Samsung Galaxy Phone.”
Create a table to hold the products:
CREATE TABLE product (
id SERIAL PRIMARY KEY,
name TEXT
);
- id: the primary key column is automatically incremented with each new row due to the SERIAL keyword applied.
- name: the names of the products that are inserted into this table
Now let’s create a few products:
INSERT INTO product (name) VALUES
('Samsung Galaxy Phone'),
('iPhone 14'),
('Google Pixel 7'),
('OnePlus 11'),
('Samsung Galaxy Tab');
Your table should look something like:
id | name
----+----------------------
1 | Samsung Galaxy Phone
2 | iPhone 14
3 | Google Pixel 7
4 | OnePlus 11
5 | Samsung Galaxy Tab
For the search query “samsung glaxy phone”, you might first try to use LIKE
to find products that roughly match the query:
SELECT name
FROM product
WHERE name LIKE '%samsung%galaxy%phone%'
-- find products similar to the search query "samsang glaxy phone"
Wildcards (‘%’) can match any sequence of characters. This query searches for product names containing the words “samsung”, “galaxy”, and “phone” in that order, with any characters in between. You quickly see how the query becomes harder to read and upgrade.
The result will be Samsung Galaxy Phone
.
If the user made a typo, and the search query becomes “samsang glaxy phone”, you won’t get any matching results. Why is that?
Limitations of the LIKE operator
The rigid LIKE
operator requires exact ordering and cannot handle typos:
The LIKE
pattern is strict about the ordering of substrings. It doesn’t handle variations in the order of words well. "samsung galaxy phone" must appear in the exact order, which might not match if users input variations like "phone samsung galaxy".
There is no support for fuzzy matching or approximate string comparison. As mentioned above, in case of a typo, a search string like “samsang glaxy phone” wouldn’t match “Samsung Galaxy Phone” using LIKE
.
Why would you do this in the Postgres database and not in the application layer or in Elasticsearch?
Making the application layer more complex might not always be a suitable long-term solution, especially if budgets are tight. Some companies view tools like Elasticsearch with skepticism, considering them costly due to expenses related to provisioning servers, storage, and the need for specialized personnel.
<rant>Today, many organizations prefer ambitious, almost non-technical staff willing to learn SQL with 2 years of experience, combined with Gemini or ChatGPT, lacking understanding of the basic client-server architecture. In such cases, every “new” technology is difficult to introduce because of the learning curve. </rant>
Solution in Postgres
pg_trgm
is an extension that provides functions and operators for fuzzy text searching. It uses trigrams to do similarity searches and approximate string matching.
Trigrams are sequences of three consecutive characters, and they can be used to compare and measure the similarity between strings. Trigrams are often used in text processing and similarity matching because they handle variations in text, different orders, common misspellings and they doesn’t conform to standard language rules (acronyms, jargon, abbreviations).
The trigrams from the product table “Samsung Galaxy Phone” will be:
"Sam", "ams", "msu", "sun", "ung", "nga", "gal", "aly", "lax", "axy", "xy ", "y G", " Ga", "Gal", "ala", "lax", "axy", "xy ", "y P", " Ph", "Pho", "hon", "one"
The trigrams from the search query “samsang galaxy phone” will be:
"sam", "ams", "msa", "sang", "ang", "nga", "gal", "ala", "lax", "axy", "xy ", "y p", " ph", "pho", "hon", "one"
To find matches, compare the items in the two lists above against each other. Here’s the list of trigrams that are present in both:
"ams", "nga", "gal", "ala", "lax", "axy", "xy ", "hon", "one"
If you want to do this in Postgres, you would first install the extension and create a trigram index on the product table’s name column:
Install the extension pg_trgm:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Create a trigram index on the name column
CREATE INDEX trgm_idx ON product USING gin (name gin_trgm_ops);
The pg_trgm
extension does not automatically lowercase text for you when it constructs trigrams or performs similarity searches. It is a good practice to normalize for case sensitivity, by converting the string to lowercase before creating the index:
-- Create a trigram index on a lowercase version of the text
CREATE INDEX trgm_idx ON product USING gin (LOWER(name) gin_trgm_ops);
LOWER(name)
in the index will treat the indexed text to as lowercase, and prepares further similarity matching to be case-insensitive.
You can find the records similar to your search string using trigram similarity. PostgreSQL will compute a similarity score based on the trigrams of the passed search query and compare it with the trigrams created based on the names column in the product table.
-- Perform a similarity search
SELECT *
FROM product
WHERE name % 'samsang galaxy phone';
-- Perform a similarity search with case insensitivity
SELECT *
FROM product
WHERE LOWER(name) % LOWER('Samsang galaxy pHonE');
For this, the %
operator is used and PostgreSQL will return rows where the similarity score exceeds a certain threshold.
The threshold is a number between 0 and 1, where 0 means that all results are returned regardless of the similarity, 1 would mean that only the exact matches are returned. The default threshold is 0.3, which means that PostgreSQL will return rows where the similarity score between the passed search string and the observed string is 0.3 or higher.
To make the similarity matching more or less strict, there is an option to change the threshold:
-- Display the current threshold
SHOW pg_trgm.similarity_threshold;
-- Set a custom threshold for the session
SET pg_trgm.similarity_threshold = 0.5;
-- Do similarity search
SELECT *
FROM product
WHERE name % 'samsang galaxy phone';
After this step, I recommend to play around with this. Document what you have done, drop the indexes and recreate them with a slight change, alter the threshold, make a few other changes and repeat. That is how you will achieve the best results.
Performance Analysis
Sometimes the queries will be slow. As always, you can use EXPLAIN ANALYZE to understand what is going on below the surface:
EXPLAIN ANALYZE
SELECT *
FROM products
WHERE LOWER(name) % LOWER('samsang galaxy phone');
Tip: To understand more about Explain Analyze, check out this article.
Large indexes and complex queries can consume a lot of disk space and memory. This makes sense when you think about the fact that multiple trigrams are generated for each text entry. This increases the size of your index (or indexes). Every write operation, whether it’s an INSERT, DELETE or UPDATE, will result with rebuilding the index. This process can cause slowdowns in other query executions.
Sometimes, when you have enough disk space but the query is still slow, the solution can be as simple as adjusting the similarity threshold, ending up with a smaller but more accurate result set. You might even improve query performance without compromising the quality of your results.
When adjusting the threshold doesn’t meet your business requirements, or if it isn’t enough to optimize performance, you may need to look at your memory settings in the postgresql.conf
file. Allocating more memory for operations by tuning work_mem
and shared_buffers
can improve performance, especially for larger datasets.
It is up to you to find a balance between disk space, memory allocation, and query precision.
Trigrams are just one of many powerful tools available for enhancing search capabilities. Next time, I might explore other similarity measures and techniques. I am thinking about Levenshtein distance and soundex, that can help refine search functionality even more.
Yours, BatCat