Monday, September 7, 2015

FM-index: 5. Optimal search

Home

Previous


So far our search was exciting, but not terribly efficient: each time we found the suffixes in the required interval, we had to scan this interval in the L column to find the desired ranked letters for the next interval. The example with cocoa$ database was tiny, but imagine that we need to scan intervals of millions characters at each step of the search algorithm.

To make search for pattern P of length |P| to be completed in exactly |P| steps we need to elaborate a little bit more on our index. Note that searching in |P| steps is optimal: in order to search for pattern of length |P| we have at least to read the entire pattern, which takes |P| steps anyway.

To achieve this optimal result, we pre-process our index as following.

For each row, instead of storing in column L characters with ranks, we are going to store several columns of ranks - one column for each letter of the alphabet. In each cell of this column we store the number of times that this particular character has been seen in column L before the position specified by the row number. This is nothing else but the highest rank of each character seen so far.

Extended index
Row F L rank $ rank a rank c rank o SA
0 $1 a1 0 1 0 0 5
1 a1 o1 0 1 0 1 4
2 c1 o2 0 1 0 2 2
3 c2 $1 1 1 0 2 0
4 o1 c1 1 1 1 2 3
5 o2 c2 1 1 2 2 1
6

1 1 2 2 1

LF-mapping table
Interval of all suffixes starting with: begins at position (in suffix array):
$ 0
a 1
c 2
o 4

Now, let's perform the search for oco again. Because this pattern has length |P|=3, we want to perform the entire search procedure in three steps.

Our goal is to obtain a consecutive interval in the suffix array for all suffixes that start with oco.

To initialize, we consider the entire interval [0,5] as a starting interval. Start s=0, end e=N-1=5.
As before, we search backwards.


Step 1. Find the interval of all suffixes that start with o. From LF table we know that the first such suffix is at position 4, and rank(at pos e=5, of o) = 2. So there are total two suffixes starting with o. The new interval becomes: s=4, e=5.  Result so far: [4,5].

Step 2. In the interval [4,5] what are the suffixes that have a letter c before o? to answer this, we look at rank (at pos s=4, of c) and rank (at pos e=5, of c):

rank(4,c) = 1, rank (5, c) = 2.

That means that the preceding cs are between the first and the second c in the suffix array. 

The start and end of a new interval is computed by:
s = LF (c) + rank(4,c) - 1 = 2 + 1 -1=2
e = LF (c) + rank (5,c) - 1= 2 + 2 - 1=3

And the narrowed interval after step 2 becomes: [2, 3].

Step 3. In the interval [2,3] what are the suffixes that are preceded by o? The rank of o at position 2 is 2, and the rank of o at position 3 is 2, so the new values of start and end become:

s = LF (o) + rank(2,o) -1 = 4 +2 -1 =5
e = LF (o) + rank (3,o) - 1 = 4 + 2 -1=5

Our final interval is [5,5].

This is an amazing result: not only we can compress our input string, but we can search for a pattern in optimal time, without decompression.

Home

Previous

FM-index. 4. Index!

Home

Previous

Next


In 2005 Paolo Ferragina and Giovanni Manzini proposed to use the BWT with LF-table as an index for pattern search, and called the new structure the FM-index.

That is how it works.

BWT (only L-column)
Row F L SA
0 $1 a1 5
1 a1 o1 4
2 c1 o2 2
3 c2 $1 0
4 o1 c1 3
5 o2 c2 1
LF-mapping table
Interval of all suffixes starting with: begins at position (in suffix array):
$ 0
a 1
c 2
o 4

Let's search for pattern oco.

As in case of reconstructing the original string, the search proceeds backwards.

First, we find an interval of all suffixes which start with o, to locate query xxo. That is easy: we know it from the LF-table: it tells us that these are suffixes in rows [4-5] of the suffix array.

Among these suffixes, we want to know which ones are preceded with c. We scan rows 4-5 of the L-column, and see that these both suffixes are preceded with c - with c1 to c2.

So where is the interval in the SA that corresponds to [c1 - c2]? It is interval [2-3] according to the LF-table. So we successfully located all suffixes that start with co, resolving by this two last characters of the query: xco

Among all suffixes that begin with co, which ones are preceded with o? There is only one such suffix, and the preceding character is o2. Where is o2 located? It is located at row 5 of the suffix array, because, according to the LF-table, all o-suffixes start in row 4, and o2 suffix is the second one, that is in row 5. We successfully located the entire query pattern oco: all the suffixes that start with oco are in the row 5 of the suffix array.

By that we can already answer an existence query: yes, our dataset contains substring oco.

In order to find at which position it is, we need to look at the value of the suffix array in row 5. This value points to the position 1 in the original string, and thus pattern oco occurs at position 1 in the cocoa$ dataset.

The amazing thing is that we found the correct position using the compressed BWT string without reconstructing the original string. All we needed to know was the rank of each character in the BWT column and the LF-mapping.

To make it stick, try now to locate, following the same logic, pattern coc.

And let's finally see what happens if we search for another pattern, such as aoa.
All suffixes that start with a are in row 1. Among these (there is only one in this tiny dataset), there is one suffix that is preceded by o, and it refers to o1. Going to row 4 for o1, we are now looking for all suffixes preceded by a. And there are no such suffixes! That means that pattern aoa is not present in our dataset. We are done, can go drink coffee.

It remains to resolve only a small technical detail: when we have located an interval in the middle of the search, we need to know the smallest and the biggest rank of the next (preceding) query character. If this interval is very large (and it is mostly the case), how do we find the required ranks quickly? We will resolve this in the next post.

Home

Previous

Next


Tuesday, December 30, 2014

FM-index. 3. Ranking chararcters

Home

Previous

Next


As we declared in the previous chapter, the sequence of characters in the last row L of the Burrows-Wheeler Matrix represents a Burrows-Wheeler Transform (BWT) of the original cocoa$.

On the other hand, this is nothing else but the sequence of characters preceding sorted suffixes in the suffix array.

Let's give each character in the first column F a rank: c with rank 1, c with rank 2 etc., in the order in which they appear in the first column of BWM. Obviously, each character of the alphabet appears consecutively in the first column, because it represents the first letter of lexicographically sorted suffixes. As we mentioned before, there are no ties in the sorting of circular strings permuted from cocoa$, due to the unique sentinel added at the end, so the rank attached to each character of cocoa$ is unique.


Burrows-Wheeler Matrix
Row F 1 2 3 4 L
0 $1 c o c o a1
1 a1 $ c o c o1
2 c1 o a $ c o2
3 c2 o c o a $1
4 o1 a $ c o c1
5 o2 c o a $ c2
Now, if we rank characters in the order in which they appear in the last column L, we may notice that the ranked characters of L correspond exactly to the ranked characters of F: for example, c2 in L is the character before suffix ocoa$, and in fact c2 in row 3 of column F is the first character of suffix cocoa$.

The ranked characters in F and L columns always refer to the same characters of the original string. Why so? Because F corresponds to the first letter of the alphabetically sorted suffixes, and L corresponds to the characters preceding alphabetically sorted suffixes, so if we have for example two c-s (not necessarily sequential) in L they in fact correspond to sorted suffixes c1... and c2... and thus the ranks for each letter in F and L are in exactly the same order.

We stated that having only BWT (column L) we can reconstruct the original string. All we need for this is the BWT string itself with ranked characters and the LF-mapping table, which basically tells us, for each letter of the alphabet, the interval for all sorted suffixes which start with this letter (consecutive intervals in column F). That is why it is called LF-mapping - because it maps ranks in the Last column to the ranks in the First column.

LF-mapping table
All suffixes starting with: are at positions (in suffix array):
$ [0-0]
a [1-1]
c [2-3]
o [4-5]

We reconstruct the original string backwards. We know that the first in the SA of sorted suffixes is always the sentinel character $, so we trivially conclude that the last character of the original string is $ (as if we did not add it ourselves). So for now we know that our original string is xxxxx$.

Which letter is before $? We look at row 0 column L and discover that before $ it was a1. In what row of the suffix array is the first a? It is in row 1, according to the LF table. So our recovering original string becomes xxxxa$, and in search for the next (preceding) letter we jump to row 1 of column L.

Here, in row 1 of column L, we see that the previous letter was o1. The LF-table tells us that o1 (the first among all o-s) is the starting character of the suffix in row 4. We update our recovering string to xxxoa$, and jump to row 4.

From here we jump to c1 and our recovering string becomes xxcoa$. c1 corresponds to the suffix in row 2, according to the LF-table, so our next stop is in row 2 of column L.

Here we see the sign: go to o2. We determine that o2 is in row 5 (we adding 1 to the start interval of all letters o in the LF-table). Our growing string becomes xocoa$, and we go to row 5.

In row 5 we see the hint: go to c2, which is in row 3 according to the LF-table. Our string becomes the perfect cocoa$, and the sentinel in row 3 of column L signifies that the tour is over.

What does it mean? It means that having only BWT string and small LF-table, we do not need to store the original string anymore, as it can always be easily reconstructed whenever needed.

In a similar spirit we can use BWT for pattern search. That makes it an index. The details are in the next post.


Home

Previous

Next







Friday, November 21, 2014

FM-index. 1. Suffix array

Home

Next


Lately, more and more people use the FM-index for practical string searches, especially in bioinformatics.

The good explanation of FM-index could be found in this blog.

However, when you follow an example in this blog, you can't help wondering - why does it work this way? Some black magic.

Here we are going to explain the magic trick of the FM-index in (hopefully) simple words and with some code examples.

Today we start with the Burrows-Wheeler Transform (BWT).

Before explaining BWT, however, we should remind ourselves that in order to perform efficient search for anything, the collection of keys to be searched should be sorted. Here we are talking about texts, so the sorting is alphabetical (lexicographical).

Suppose you want to know if there is a cob in the cocoa powder, that is you want to answer if cocoa data set contains keyword cob.  Because our dataset contains only a sequence of 5 characters, it is easy to see that cob is not a part of cocoa, so our cocoa is cob-free.

Now imagine that our dataset consists of a much longer but still single non-divided sequence of characters (called string): cocoacoacoacoacoacoa... up to 3,000,000,000 characters, and you want to know whether the cob substring is somewhere in this dataset. That reflects the typical situation with DNA sequence databases, where each sequence can be prohibitively long. If scanning and checking the entire dataset is out of question, you need to come up with some sort of a cocoa index.

So let's treat cocoa as a very long string, and see how we can index this string so we can then perform search for pattern P in the most efficient way.

First, let's look at the following data structure: suffix array.

The suffix of a given string is a substring starting at some position and running till the end of the string. For example, -coa is a suffix of cocoa starting at position 2 (our position count starts at 0).

Take all suffixes of the input string, and sort them alphabetically (lexicographically). Then record a start position of each sorted suffix - and you obtain a suffix array SA.

Indexing cocoa database with suffix arrays.

First, let's add a terminal character $ with a beautiful name sentinel at the end of our cocoa: cocoa$.  By convention, $ is lexicographically the smallest symbol in the alphabet, and thus the collection of sorted suffixes becomes:

String data
0 1 2 3 4 5
c o c o a $

Suffix array (SA) index
Row Sorted suffixes SA (start positions)
0 $ 5
1 a$ 4
2 coa$ 2
3 cocoa$ 0
4 oa$ 3
5 ocoa$ 1

If the entire size of the database is N characters (N=6 for cocoa$), we can now search for any query in time O(log N), using a binary search through these sorted suffixes. And, basically, the space for this index is linear in N, because we store only start positions - one per character (Only the last column of Table 1 is called a suffix array). However, we need to ensure that we can easily access the original string at any random position, so we can compare our query pattern to the actual string.

The Burrows Wheeler Transform also indexes all substrings of a string database, but in a different rather ingenious way.


Home

Next


FM-index. 2. Burrows-Wheeler Matrix

Home

Previous

Next



We are back to our cocoa$ dataset.

First let's make our string circular: cocoa$cocoa$... That means that after we reach sentinel $, we start from the beginning of the never-ending string. Something like serpent or dragon eating its own tail: see Ouroboros.

From this circular cocoa$, let's create N different permuted strings starting at each of N possible positions: cocoa$, ocoa$c, coa$co, oa$coc, a$coco, $cocoa. Next, let's sort these strings lexicographically:

Original data
0 1 2 3 4 5
c o c o a $

Circular strings sorted
Row Sorted circular strings Sorted suffixes SA (start positions)
0 $cocoa $ 5
1 a$coco a$ 4
2 coa$co coa$ 2
3 cocoa$ cocoa$ 0
4 oa$coc oa$ 3
5 ocoa$c ocoa$ 1
Note that the order of these sorted circular strings, with regard to their start positions, is exactly the same as in the suffix array. Why so? Because after we added the unique sentinel symbol at the end of cocoa, there cannot be two suffixes with the same lexicographical rank (our sorted list has no ties). That means that the order of circular strings is uniquely determined by the part which comes before and includes sentinel $. But this is exactly the order of the suffixes - suffixes run up-to and including the sentinel!

Now the fun part begins. We concentrate only on the table of sorted circular strings, and in particular on its first and last columns: F and  L:

Burrows-Wheeler Matrix
Row F 1 2 3 4 L
0 $ c o c o a
1 a $ c o c o
2 c o a $ c o
3 c o c o a $
4 o a $ c o c
5 o c o a $ c
In fact, in the last column of the Burrows-Wheeler Matrix (BWM) we see nothing else but the characters preceding each sorted suffix in the suffix array, except that we made our cocoa$ circular and thus before character at position zero there is the sentinel (going around).

The last column L of the BWM is the famous Burrows-Wheeler Transform (BWT). If we only store this last column, it occupies space not larger than the original string. In fact, it often occupies much less space because the characters in BWT tend to be grouped into runs of equal characters, and as such the BWT string can be "compressed". But compression is not the topic of this tutorial.

What is amazing about the Burrows-Wheeler Transform of cocoa$ is that if we store only the permuted string of BWT (column L), we do not actually need our original string anymore, because we can always reconstruct the original string from the BWT string.

How? Coming soon.

Home

Previous

Next