Khmer word spell correction using BK-Tree data structure and Levenshtein distance

Engleang Sam
4 min readFeb 20, 2021

--

How to write the spell correction in a simple code?

This question can be answered based on BK-Tree data structure and distance algorithm called “Levenshtein distance” and we will study below one by one.

What is Levenshtein distance ?

This Levenshtein distance is calculated difference between String and can be called as edit distance.

For example , to compare between “hello” and “hell” what is the distance , we can see that only one character need to be added to “hell” to become “hello”. Therefore , Levenshtein distance between “hello” and “hell” should be equal to 1.

Total distance is the total of edit, delete and add operation to the any one of string to be the same as another one. Below is the math equation :

Levenshtein equation , from Wikipedia

Another example of edit distance in Khmer word between “សូរ” and “សូម” which elaborated as “ ស +ូ+ រ” and “ស + ូ+ ម” and lead to only one edit difference by replacing between “ រ “ and “ ម “. So the result of Levenshtein distance is 1.

The above equation can be coded as Java method below:

What is BK-Tree ?

According to geeksforgeeks.org , BK-Tree is the data structure that is used to perform spell check based on Edit Distance (Levenshtein distance) concept.

The famous search engine library Apache Lucene also leverage this data structure for their fuzzy search( best matched file searching ).

It stores the word inside a tree node pointer and put the edit distance as mapping from node to either its root or its parent. The below image is for sample visualized of BK-Tree.

BK-Tree visualized node

Crucial API(s):

Adding the node to Bk-Tree , it does following steps :

Step 1: compute the Levenshtein distance between pointing Node( current Node or Root Node if it’s first calculation) ‘s word and adding word

Step 2: if the distance is not found yet in the map , it will add to the current Node ( root or parent node ) map , otherwise it will find the child of current Node ( root or parent ) with the same distance and repeat the Step 1 by trying to add the word to child node as current node ( recursive process ).

BK-Tree exposes query function for searching the nearest distance of the word based on threshold as following steps:

Step 1 : compute the Levenshtein distance between the pointing Node ( current node or Root Node if it’s first calculation)

Step 2: if the distance less than or equal threshold add it to the result collection

Step 3: for each score that greater than 0 and less than current distance + threshold , for each children mapped to this score do step 1 for this children and the word ( point child as current node , recursive process ) .

Note : BK-Tree is based on existing word dictionary in order to construct and map it.

Below is the full definition of the BK Tree in Java:

By running experiment code above, it got the result like this :

Words list that near to the some incorrect words with its distance

BK-Tree can solve spell correction and suggest the correct word instantly based on the edit distance . However , the accuracy still need to enhance as some word can be consider as very similar in term of context , word root or paired consonant in Khmer language , but calculated naively. Probability based can also be studied and experimented such as this article : http://norvig.com/spell-correct.html

In conclusion , BK-Tree introduces a very elegant data structure and algorithm to find the correct spelling which can be implemented with a great time complexity :O(L1*L2*log n) while L1 and L2 are the length of each string to be calculate respectively and n is the length of dictionary.

The full code and data source can be found here https://github.com/engleangs/bktree

References :

https://issues.apache.org/jira/secure/attachment/12431059/BKTree.java

https://www.geeksforgeeks.org/bk-tree-introduction-implementation/

Data source based on this paper : https://www.researchgate.net/publication/301407010_Khmer_word_segmentation_based_on_Bi-directional_Maximal_Matching_for_Plaintext_and_Microsoft_Word_document

--

--

Engleang Sam

Sr. Digital Development Manager @MJQE , Like to keep up with technologies such as algorithms , backend development , solution architecture.