Jaro winkler and levenshtein: the right to make mistake

When you do a search on a search engine, you have probably already made a typo, or reversed two letters. Despite this fault, you still have relevant results coming out. How is it possible ? We are going to see two algorithms that allow your users to make mistakes when they do a search. Or to be able to compare two character strings and see their proximity.

In SQL, when you make database queries using a character string as search criteria, you will have several search possibilities:

  • with the exact string (= ‘theString’)
  • starting with the string (LIKE ‘theString%’)
  • ending with the string (LIKE ‘%theString’)
  • containing the string (LIKE ‘%theString%)

On the other hand, if you make a typing error or a character inversion (eg ‘laChaien’), you will not find any corresponding record. While in fact your search is quite close. There are algorithms to make string comparisons, and give you a level of similarity between two strings.

You can also use these methods if you are doing data transfer, or if you are looking to integrate data that has been entered by hand.

Jaro Winkler

With Jaro Winkler’s distance, you will be able to “measure” the similarity of two character strings. The result obtained will be between 0 and 1, 1 being a perfect equality between the two strings.

If you want to see exactly how it works under the hood, you can check out the Wikipedia entry .

If you want to use it in your developments, you can find the source codes on the net. For example the C source code , or if you use MySQL functions and procedures the MySQL function source code .

Be careful though, these character comparison functions are quite resource intensive, and can take time to execute. Especially if you want to compare one user input to thousands or millions of entries based on a string.

Here is what gives some executions of the function in MySQL with different strings:

Research Jaro Winkler

Levenshtein

The Levenshtein distance indicates the number of manipulations (addition, deletion, replacement) that must be done to get from one string to another. So 0 is a perfect equality, and the more the number increases, the more the strings are different.

If you want to know more, you can go to the Wikipedia page .

To implement it, you can find the code on the web, for example the source code based on MySQL . Same remark as for Jaro Winkler, these are consuming functions.

Here is what gives some executions of the function in MySQL with different strings:

Research Levenshtein

Leave a Reply