Friday, November 21, 2014

FM-index. 2. Burrows-Wheeler Matrix

Home

Previous

Next



We are back to our cocoa$ dataset.

First let's make our string circular: cocoa$cocoa$... That means that after we reach sentinel $, we start from the beginning of the never-ending string. Something like serpent or dragon eating its own tail: see Ouroboros.

From this circular cocoa$, let's create N different permuted strings starting at each of N possible positions: cocoa$, ocoa$c, coa$co, oa$coc, a$coco, $cocoa. Next, let's sort these strings lexicographically:

Original data
0 1 2 3 4 5
c o c o a $

Circular strings sorted
Row Sorted circular strings Sorted suffixes SA (start positions)
0 $cocoa $ 5
1 a$coco a$ 4
2 coa$co coa$ 2
3 cocoa$ cocoa$ 0
4 oa$coc oa$ 3
5 ocoa$c ocoa$ 1
Note that the order of these sorted circular strings, with regard to their start positions, is exactly the same as in the suffix array. Why so? Because after we added the unique sentinel symbol at the end of cocoa, there cannot be two suffixes with the same lexicographical rank (our sorted list has no ties). That means that the order of circular strings is uniquely determined by the part which comes before and includes sentinel $. But this is exactly the order of the suffixes - suffixes run up-to and including the sentinel!

Now the fun part begins. We concentrate only on the table of sorted circular strings, and in particular on its first and last columns: F and  L:

Burrows-Wheeler Matrix
Row F 1 2 3 4 L
0 $ c o c o a
1 a $ c o c o
2 c o a $ c o
3 c o c o a $
4 o a $ c o c
5 o c o a $ c
In fact, in the last column of the Burrows-Wheeler Matrix (BWM) we see nothing else but the characters preceding each sorted suffix in the suffix array, except that we made our cocoa$ circular and thus before character at position zero there is the sentinel (going around).

The last column L of the BWM is the famous Burrows-Wheeler Transform (BWT). If we only store this last column, it occupies space not larger than the original string. In fact, it often occupies much less space because the characters in BWT tend to be grouped into runs of equal characters, and as such the BWT string can be "compressed". But compression is not the topic of this tutorial.

What is amazing about the Burrows-Wheeler Transform of cocoa$ is that if we store only the permuted string of BWT (column L), we do not actually need our original string anymore, because we can always reconstruct the original string from the BWT string.

How? Coming soon.

Home

Previous

Next

No comments:

Post a Comment