Home
Previous
Next
In 2005 Paolo Ferragina and Giovanni Manzini proposed to use the BWT with LF-table as an index for pattern search, and called the new structure the FM-index.
That is how it works.
Row | F | L | SA |
---|---|---|---|
0 | $1 | a1 | 5 |
1 | a1 | o1 | 4 |
2 | c1 | o2 | 2 |
3 | c2 | $1 | 0 |
4 | o1 | c1 | 3 |
5 | o2 | c2 | 1 |
Interval of all suffixes starting with: | begins at position (in suffix array): |
---|---|
$ | 0 |
a | 1 |
c | 2 |
o | 4 |
Let's search for pattern oco.
As in case of reconstructing the original string, the search proceeds backwards.
First, we find an interval of all suffixes which start with o, to locate query xxo. That is easy: we know it from the LF-table: it tells us that these are suffixes in rows [4-5] of the suffix array.
Among these suffixes, we want to know which ones are preceded with c. We scan rows 4-5 of the L-column, and see that these both suffixes are preceded with c - with c1 to c2.
So where is the interval in the SA that corresponds to [c1 - c2]? It is interval [2-3] according to the LF-table. So we successfully located all suffixes that start with co, resolving by this two last characters of the query: xco
Among all suffixes that begin with co, which ones are preceded with o? There is only one such suffix, and the preceding character is o2. Where is o2 located? It is located at row 5 of the suffix array, because, according to the LF-table, all o-suffixes start in row 4, and o2 suffix is the second one, that is in row 5. We successfully located the entire query pattern oco: all the suffixes that start with oco are in the row 5 of the suffix array.
By that we can already answer an existence query: yes, our dataset contains substring oco.
In order to find at which position it is, we need to look at the value of the suffix array in row 5. This value points to the position 1 in the original string, and thus pattern oco occurs at position 1 in the cocoa$ dataset.
The amazing thing is that we found the correct position using the compressed BWT string without reconstructing the original string. All we needed to know was the rank of each character in the BWT column and the LF-mapping.
To make it stick, try now to locate, following the same logic, pattern coc.
And let's finally see what happens if we search for another pattern, such as aoa.
All suffixes that start with a are in row 1. Among these (there is only one in this tiny dataset), there is one suffix that is preceded by o, and it refers to o1. Going to row 4 for o1, we are now looking for all suffixes preceded by a. And there are no such suffixes! That means that pattern aoa is not present in our dataset. We are done, can go drink coffee.
It remains to resolve only a small technical detail: when we have located an interval in the middle of the search, we need to know the smallest and the biggest rank of the next (preceding) query character. If this interval is very large (and it is mostly the case), how do we find the required ranks quickly? We will resolve this in the next post.
No comments:
Post a Comment