Kunth-Morris-Pratt(KMP) Algorithm For Pattern Searching

Avishek Patra
7 min readMar 19, 2020
Image Source Google

Introduction

Pattern searching is an important problem in computer science. When we do search for a string in notepad/word file or browser or database, pattern searching algorithms are used to show the search results. Here is the list of available algorithms available to for this Job

  1. Naive string-search algorithm:-

Preprocessing time:- none || Matching time:- Θ(n*m).

2. Rabin–Karp algorithm:-

Preprocessing time:-Θ(m) || Matching time:- average Θ(n + m), worst Θ((n−m)*m).

3. Knuth–Morris–Pratt algorithm(KMP):-

Preprocessing time:- Θ(m)|| Matching time:-Θ(n).

4. Boyer–Moore string-search algorithm:-

Preprocessing time:- Θ(m + k) || Matching time:- best Ω(n/m), worst O(m*n).

5. Z-algorithm:-

Time complexity :- O(m+n). || Space complexity :-O(m).

Generally, the best algorithm for pattern searching is the KMP algorithm. In terms of time complexity and as well as space complexity.

The time complexity is O(m+n).

where m=size of the array.

n=size of string need to be searched.

We will be learning more about KMP Algorithm along with the implementation later in this tutorial.

Understanding the Algorithm

Let’s look at the following problem statement.

Input String: The world is completely devastated with the amid breakout of coronavirus.

Now we have to find if the string contains the following

Pattern String: coronavirus

The traditional way of doing it using the Naive Algorithm

Let’s see what Naive Algorithm says

Naive Algorithm:
Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches. Which means we have to start with the first character of the input string and first character of the pattern string and then if a match found then we can increment our index to one and look at the next character. Let’s look at the best and worst case of this algorithm

What is the best case?
The best-case occurs when the first character of the pattern is not present in the text at all.

txt[] = "AABCCAADDEE";

pat[] = "FAA";

The number of comparisons in the best case is O(n).

What is the worst case?
The worst case of Naive Pattern Searching occurs in the following scenarios.
1) When all characters of the text and pattern are the same.

txt[] = "AAAAAAAAAAAAAAAAAA";

pat[] = "AAAAA";

2) The worst-case also occurs when only the last character is different.

txt[] = "AAAAAAAAAAAAAAAAAB";

pat[] = "AAAAB";

The number of comparisons in the worst case is O(m*(n-m+1)). Although strings which have repeated characters are not likely to appear in English text, they may well occur in other applications (for example, in binary texts).

KMP (Knuth Morris Pratt) Pattern Searching

The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character. Following are some examples.

txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)

The KMP matching algorithm uses degenerating property (pattern having the same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst-case complexity to O(n).

The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match. Let us consider the below example to understand this.

Matching Overview
txt = "AAAAABAAABA"
pat = "AAAA"
We compare first window of txt with pat
txt = "AAAAABAAABA"
pat = "AAAA" [Initial position]
We find a match. This is same as Naive String Matching.
In the next step, we compare next window of txt with pat.
txt = "AAAAABAAABA"
pat = "AAAA" [Pattern shifted one position]

This is where KMP does optimization over Naive. In this the second window, we only compare the fourth A of pattern with the fourth character of the current window of text to decide whether current window matches or not. Since we know the first three characters will anyway match, we skipped matching the first three characters.

The need for Preprocessing?

An important question arises from the above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array lps[]/Pi Table that tells us the count of characters to be skipped.

Preprocessing Overview:

  • KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same as the size of pattern) which is used to skip characters while matching.
  • name lps indicates the longest proper prefix which is also suffix. A proper prefix is prefixed with whole string not allowed. For example, prefixes of “ABC” are “ ”, “A”, “AB” and “ABC”. Proper prefixes are “ ”, “A” and “AB”. Suffixes of the string are “ ”, “C”, “BC” and “ABC”.
  • We search for lps in sub-patterns. More clearly we focus on sub-strings of patterns that are either prefix and suffix.
  • For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].

Note: lps[i] could also be defined as the longest prefix which is also proper suffix. We need to use properly at one place to make sure that the whole substring is not considered.

Searching Algorithm:
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at each shift, we use a value from lps[] to decide the next characters to be matched. The idea is to not match a character that we know will anyway match.

How to use lps[] to decide the next positions (or to know a number of characters to be skipped)?

We start a comparison of pat[j] with j = 0 with characters of the current window of text. We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j] and txt[i] keep matching.
When we see a mismatch
We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that j starts with 0 and increment it only when there is a match).
We also know (from the above definition) that lps[j-1] is the count of characters of pat[0…j-1] that are both proper prefix and suffix.
From the above two points, we can conclude that we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match. Let us consider the above example to understand this.

txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}

i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1

i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue like this until we find a match or reach the end of the string.

Implementation (Typescript)

So discussed here is the implementation of the KMP algorithm in typescript.

The function buildPatternTable actually creating the lps array which we have discussed earlier, if you take a closer look at the function you can see this function will be creating the pi table and iterate the value of “j” based on the match found or not found in the parent string.

Test Cases to evaluate the KMP algorithm

import knuthMorrisPratt from '../app';describe('knuthMorrisPratt', () => {
it('should find word position in given text', () => {
expect(knuthMorrisPratt('abcbcglx', 'abca')).toBe(-1);
expect(knuthMorrisPratt('abcbcglx', 'bcgl')).toBe(3);
expect(knuthMorrisPratt('abcxabcdabxabcdabcdabcy', 'abcdabcy')).toBe(15);
expect(knuthMorrisPratt('abcxabcdabxabcdabcdabcy', 'abcdabca')).toBe(-1);
expect(knuthMorrisPratt('abcxabcdabxaabcdabcabcdabcdabcy', 'abcdabca')).toBe(12);
expect(knuthMorrisPratt('abcxabcdabxaabaabaaaabcdabcdabcy', 'aabaabaaa')).toBe(11);
});
});

You should see the desired output when you execute the test cases.

The full source code is shared in the following git repository.

So I hope you’re able to understand the implementation of the KMP algorithm. If you’ve liked the tutorial then please hit a clap in the medium article, this will encourage me to come up with some new content every weekend.

You can follow me on twitter to get the latest updates on my tutorials.

The full source code is shared in the following git repo.

--

--