Tuesday, December 30, 2014

FM-index. 3. Ranking chararcters

Home

Previous

Next


As we declared in the previous chapter, the sequence of characters in the last row L of the Burrows-Wheeler Matrix represents a Burrows-Wheeler Transform (BWT) of the original cocoa$.

On the other hand, this is nothing else but the sequence of characters preceding sorted suffixes in the suffix array.

Let's give each character in the first column F a rank: c with rank 1, c with rank 2 etc., in the order in which they appear in the first column of BWM. Obviously, each character of the alphabet appears consecutively in the first column, because it represents the first letter of lexicographically sorted suffixes. As we mentioned before, there are no ties in the sorting of circular strings permuted from cocoa$, due to the unique sentinel added at the end, so the rank attached to each character of cocoa$ is unique.


Burrows-Wheeler Matrix
Row F 1 2 3 4 L
0 $1 c o c o a1
1 a1 $ c o c o1
2 c1 o a $ c o2
3 c2 o c o a $1
4 o1 a $ c o c1
5 o2 c o a $ c2
Now, if we rank characters in the order in which they appear in the last column L, we may notice that the ranked characters of L correspond exactly to the ranked characters of F: for example, c2 in L is the character before suffix ocoa$, and in fact c2 in row 3 of column F is the first character of suffix cocoa$.

The ranked characters in F and L columns always refer to the same characters of the original string. Why so? Because F corresponds to the first letter of the alphabetically sorted suffixes, and L corresponds to the characters preceding alphabetically sorted suffixes, so if we have for example two c-s (not necessarily sequential) in L they in fact correspond to sorted suffixes c1... and c2... and thus the ranks for each letter in F and L are in exactly the same order.

We stated that having only BWT (column L) we can reconstruct the original string. All we need for this is the BWT string itself with ranked characters and the LF-mapping table, which basically tells us, for each letter of the alphabet, the interval for all sorted suffixes which start with this letter (consecutive intervals in column F). That is why it is called LF-mapping - because it maps ranks in the Last column to the ranks in the First column.

LF-mapping table
All suffixes starting with: are at positions (in suffix array):
$ [0-0]
a [1-1]
c [2-3]
o [4-5]

We reconstruct the original string backwards. We know that the first in the SA of sorted suffixes is always the sentinel character $, so we trivially conclude that the last character of the original string is $ (as if we did not add it ourselves). So for now we know that our original string is xxxxx$.

Which letter is before $? We look at row 0 column L and discover that before $ it was a1. In what row of the suffix array is the first a? It is in row 1, according to the LF table. So our recovering original string becomes xxxxa$, and in search for the next (preceding) letter we jump to row 1 of column L.

Here, in row 1 of column L, we see that the previous letter was o1. The LF-table tells us that o1 (the first among all o-s) is the starting character of the suffix in row 4. We update our recovering string to xxxoa$, and jump to row 4.

From here we jump to c1 and our recovering string becomes xxcoa$. c1 corresponds to the suffix in row 2, according to the LF-table, so our next stop is in row 2 of column L.

Here we see the sign: go to o2. We determine that o2 is in row 5 (we adding 1 to the start interval of all letters o in the LF-table). Our growing string becomes xocoa$, and we go to row 5.

In row 5 we see the hint: go to c2, which is in row 3 according to the LF-table. Our string becomes the perfect cocoa$, and the sentinel in row 3 of column L signifies that the tour is over.

What does it mean? It means that having only BWT string and small LF-table, we do not need to store the original string anymore, as it can always be easily reconstructed whenever needed.

In a similar spirit we can use BWT for pattern search. That makes it an index. The details are in the next post.


Home

Previous

Next







No comments:

Post a Comment