Home
Next
Lately, more and more people use the FM-index for practical string searches, especially in bioinformatics.
The good explanation of FM-index could be found in this blog.
However, when you follow an example in this blog, you can't help wondering - why does it work this way? Some black magic.
Here we are going to explain the magic trick of the FM-index in (hopefully) simple words and with some code examples.
Today we start with the Burrows-Wheeler Transform (BWT).
Before explaining BWT, however, we should remind ourselves that in order to perform efficient search for anything, the collection of keys to be searched should be sorted. Here we are talking about texts, so the sorting is alphabetical (lexicographical).
Suppose you want to know if there is a cob in the cocoa powder, that is you want to answer if cocoa data set contains keyword cob. Because our dataset contains only a sequence of 5 characters, it is easy to see that cob is not a part of cocoa, so our cocoa is cob-free.
Now imagine that our dataset consists of a much longer but still single non-divided sequence of characters (called string): cocoacoacoacoacoacoa... up to 3,000,000,000 characters, and you want to know whether the cob substring is somewhere in this dataset. That reflects the typical situation with DNA sequence databases, where each sequence can be prohibitively long. If scanning and checking the entire dataset is out of question, you need to come up with some sort of a cocoa index.
So let's treat cocoa as a very long string, and see how we can index this string so we can then perform search for pattern P in the most efficient way.
First, let's look at the following data structure: suffix array.
The suffix of a given string is a substring starting at some position and running till the end of the string. For example, -coa is a suffix of cocoa starting at position 2 (our position count starts at 0).
Take all suffixes of the input string, and sort them alphabetically (lexicographically). Then record a start position of each sorted suffix - and you obtain a suffix array SA.
Indexing cocoa database with suffix arrays.
First, let's add a terminal character $ with a beautiful name sentinel at the end of our cocoa: cocoa$. By convention, $ is lexicographically the smallest symbol in the alphabet, and thus the collection of sorted suffixes becomes:0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
c | o | c | o | a | $ |
Row | Sorted suffixes | SA (start positions) |
---|---|---|
0 | $ | 5 |
1 | a$ | 4 |
2 | coa$ | 2 |
3 | cocoa$ | 0 |
4 | oa$ | 3 |
5 | ocoa$ | 1 |
If the entire size of the database is N characters (N=6 for cocoa$), we can now search for any query in time O(log N), using a binary search through these sorted suffixes. And, basically, the space for this index is linear in N, because we store only start positions - one per character (Only the last column of Table 1 is called a suffix array). However, we need to ensure that we can easily access the original string at any random position, so we can compare our query pattern to the actual string.
The Burrows Wheeler Transform also indexes all substrings of a string database, but in a different rather ingenious way.
No comments:
Post a Comment