Similarity search YSQL
Similarity matching works by determining how similar two different strings are. This can be helpful when you don't know the exact spelling of your query term and can be used to design spell checkers.
Setup
Setup
To set up a local universe, refer to Set up a local YugabyteDB universe.Setup
To set up a cluster, refer to Set up a YugabyteDB Aeon cluster.Setup
To set up a universe, refer to Set up a YugabyteDB Anywhere universe.Create the following table:
CREATE TABLE words (
id SERIAL,
word TEXT NOT NULL,
PRIMARY KEY(id)
);
Load some sample words into the table as follows:
INSERT INTO words(word) VALUES
('anopisthographic'),('ambassadorship'),('ambuscader'),('ambiguity'),('ampycides'),
('anapaestically'),('anapests'),('anapsidan'),('unpsychic'),('anapsid'),
('unpessimistic'),('unepistolary'),('inabstracted'),('anapaest'),('unobstinate'),
('amphigoric'),('amebic'),('amebous'),('ambassage'),('unpacified'),('unposing');
Levenshtein
The Levenshtein distance is a measure of the difference between 2 strings. It calculates the difference by considering the number of edits (insertions, deletions, and substitutions) needed for one string to be transformed into another. This is particularly well suited to spell-checks. This function is provided by the PostgreSQL extension fuzzystrmatch.
To enable the Levenshtein function, activate the fuzzystrmatch
extension as follows:
CREATE extension IF NOT EXISTS fuzzystrmatch;
For example, to identify how the different mis-spellings of the warehouse
, do the following:
SELECT word, levenshtein('warehouse', word) FROM words WHERE levenshtein('warehouse', word) < 3 ORDER BY levenshtein('warehouse', word) ASC LIMIT 10;
word | levenshtein
------------+-------------
warehouse | 0
warmhouse | 1
warehoused | 1
warehouser | 1
warhorse | 2
tirehouse | 2
washhouse | 2
carhouse | 2
firehouse | 2
gatehouse | 2
The Levenshtein scoring for warehoused
is 1
because it has one more character (d
) than warehouse
. The scoring is 2
for wayhouse
because it needs two edits (r->y
and del e
).
Trigrams
A trigram is a group of three consecutive characters taken from a string. You can measure the similarity of two strings by counting the number of trigrams they share. The pg_trgm extension provides multiple functions like show_trgm
and similarity
, which provide a score of how similar two strings are.
For example, the trigrams for warehouse
would be as follows:
{" w"," wa",are,eho,hou,ous,reh,"se ",use,war}
To enable the trigrams functionality, activate the pg_trgm
extension:
CREATE extension IF NOT EXISTS pg_trgm;
For example, suppose you are looking for words with spelling close to geodymamist
. To get the actual word, you could do the following:
SELECT word, similarity(word, 'geodymamist') as score FROM words ORDER BY score DESC LIMIT 5;
word | score
---------------+----------
geodynamicist | 0.444444
geodesist | 0.375
geochemist | 0.352941
geodynamics | 0.333333
geologist | 0.294118
To match word boundaries by avoiding cross-word trigrams, you can use the strict_word_similarity
function. For example:
SELECT strict_word_similarity('word', 'two words'), similarity('word', 'two words');
strict_word_similarity | similarity
------------------------+------------
0.571429 | 0.363636
The strict_word_similarity
is higher than the similarity
as it gave higher importance to the presence of the exact term word
in both strings.
pg_trgm
supports the gin_trgm_ops
operator class, which is specifically designed for GIN (Generalized Inverted Index) indexes to efficiently handle trigram-based similarity searches. gin_trgm_ops
improves the performance of LIKE
and ILIKE
queries by extracting the trigrams and indexing them in the GIN index.
For example, take the following query:
SELECT * FROM my_table WHERE my_column LIKE '%search_term%';
To improve its performance, create the following index:
CREATE INDEX idx_gin_trgm ON my_table USING gin (my_column gin_trgm_ops);
gin_trgm_ops
is a good choice.