Home
Previous
So far our search was exciting, but not terribly efficient: each time we found the suffixes in the required interval, we had to scan this interval in the L column to find the desired ranked letters for the next interval. The example with cocoa$ database was tiny, but imagine that we need to scan intervals of millions characters at each step of the search algorithm.
To make search for pattern P of length |P| to be completed in exactly |P| steps we need to elaborate a little bit more on our index. Note that searching in |P| steps is optimal: in order to search for pattern of length |P| we have at least to read the entire pattern, which takes |P| steps anyway.
To achieve this optimal result, we pre-process our index as following.
For each row, instead of storing in column L characters with ranks, we are going to store several columns of ranks - one column for each letter of the alphabet. In each cell of this column we store the number of times that this particular character has been seen in column L before the position specified by the row number. This is nothing else but the highest rank of each character seen so far.
Row | F | L | rank $ | rank a | rank c | rank o | SA |
---|---|---|---|---|---|---|---|
0 | $1 | a1 | 0 | 1 | 0 | 0 | 5 |
1 | a1 | o1 | 0 | 1 | 0 | 1 | 4 |
2 | c1 | o2 | 0 | 1 | 0 | 2 | 2 |
3 | c2 | $1 | 1 | 1 | 0 | 2 | 0 |
4 | o1 | c1 | 1 | 1 | 1 | 2 | 3 |
5 | o2 | c2 | 1 | 1 | 2 | 2 | 1 |
6 | 1 | 1 | 2 | 2 | 1 |
Interval of all suffixes starting with: | begins at position (in suffix array): |
---|---|
$ | 0 |
a | 1 |
c | 2 |
o | 4 |
Now, let's perform the search for oco again. Because this pattern has length |P|=3, we want to perform the entire search procedure in three steps.
Our goal is to obtain a consecutive interval in the suffix array for all suffixes that start with oco.
To initialize, we consider the entire interval [0,5] as a starting interval. Start s=0, end e=N-1=5.
As before, we search backwards.
Step 1. Find the interval of all suffixes that start with o. From LF table we know that the first such suffix is at position 4, and rank(at pos e=5, of o) = 2. So there are total two suffixes starting with o. The new interval becomes: s=4, e=5. Result so far: [4,5].
Step 2. In the interval [4,5] what are the suffixes that have a letter c before o? to answer this, we look at rank (at pos s=4, of c) and rank (at pos e=5, of c):
rank(4,c) = 1, rank (5, c) = 2.
That means that the preceding cs are between the first and the second c in the suffix array.
The start and end of a new interval is computed by:
s = LF (c) + rank(4,c) - 1 = 2 + 1 -1=2
e = LF (c) + rank (5,c) - 1= 2 + 2 - 1=3
And the narrowed interval after step 2 becomes: [2, 3].
Step 3. In the interval [2,3] what are the suffixes that are preceded by o? The rank of o at position 2 is 2, and the rank of o at position 3 is 2, so the new values of start and end become:
s = LF (o) + rank(2,o) -1 = 4 +2 -1 =5
e = LF (o) + rank (3,o) - 1 = 4 + 2 -1=5
Our final interval is [5,5].
This is an amazing result: not only we can compress our input string, but we can search for a pattern in optimal time, without decompression.