Home
Previous
Next
We are back to our cocoa$ dataset.
First let's make our string circular: cocoa$cocoa$... That means that after we reach sentinel $, we start from the beginning of the never-ending string. Something like serpent or dragon eating its own tail: see Ouroboros.
From this circular cocoa$, let's create N different permuted strings starting at each of N possible positions: cocoa$, ocoa$c, coa$co, oa$coc, a$coco, $cocoa. Next, let's sort these strings lexicographically:
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
c | o | c | o | a | $ |
Row | Sorted circular strings | Sorted suffixes | SA (start positions) |
---|---|---|---|
0 | $cocoa | $ | 5 |
1 | a$coco | a$ | 4 |
2 | coa$co | coa$ | 2 |
3 | cocoa$ | cocoa$ | 0 |
4 | oa$coc | oa$ | 3 |
5 | ocoa$c | ocoa$ | 1 |
Now the fun part begins. We concentrate only on the table of sorted circular strings, and in particular on its first and last columns: F and L:
Row | F | 1 | 2 | 3 | 4 | L |
---|---|---|---|---|---|---|
0 | $ | c | o | c | o | a |
1 | a | $ | c | o | c | o |
2 | c | o | a | $ | c | o |
3 | c | o | c | o | a | $ |
4 | o | a | $ | c | o | c |
5 | o | c | o | a | $ | c |
The last column L of the BWM is the famous Burrows-Wheeler Transform (BWT). If we only store this last column, it occupies space not larger than the original string. In fact, it often occupies much less space because the characters in BWT tend to be grouped into runs of equal characters, and as such the BWT string can be "compressed". But compression is not the topic of this tutorial.
What is amazing about the Burrows-Wheeler Transform of cocoa$ is that if we store only the permuted string of BWT (column L), we do not actually need our original string anymore, because we can always reconstruct the original string from the BWT string.
How? Coming soon.
No comments:
Post a Comment