Wednesday, April 29, 2015

Burrows-Wheeler Transform

The Burrows–Wheeler transform is a reversible method of rearranging text used to improve the performance of compression algorithms, such as bzip2.

We will implement transform, bwt, and inverse transform, ibwt, in both Python and Factor. First with a slow and simple algorithm, and then second with a faster version.

Version 1

We start with the pseudocode suggested in the Wikipedia article:

function BWT (string s)
   append an 'EOF' character to s
   create a table, rows are all possible rotations of s
   sort rows alphabetically
   return (last column of the table)

In Python, this might look like:

def bwt(s):
    s = s + '\0'
    n = len(s)
    m = sorted(s[i:] + s[:i] for i in range(n))
    return ''.join(x[-1] for x in m)

In Factor, using all-rotations, it might look like this:

: bwt ( seq -- seq' )
    0 suffix all-rotations natural-sort [ last ] map ;

The pseudocode to perform the inverse transform:

function inverseBWT (string s)
   create empty table 
       
   repeat length(s) times
       // first insert creates first column
       insert s as a column of table before first column
       sort rows of the table alphabetically
   return (row that ends with the 'EOF' character)

In Python, this might look like:

def ibwt(s):
    n = len(s)
    m = [''] * n
    for _ in range(n):
        m = sorted(s[i] + m[i] for i in range(n))
    return [x for x in m if x.endswith('\0')][0][:-1]

In Factor, we could implement it like this:

: ibwt ( seq -- seq' )
    [ length [ "" <array> ] keep ] keep
    '[ _ [ prefix ] 2map natural-sort ] times
    [ last 0 = ] find nip but-last ;

Unfortunately, this is very slow, with most of the performance loss in the invert transform.

Version 2

Another way to increase the speed of BWT inverse is to use an algorithm that returns an index into the sorted rotations along with the transform.

In Python, it looks like this:

def bwt(s):
    n = len(s)
    m = sorted(s[i:] + s[:i] for i in range(n))
    return m.index(s), ''.join(x[-1] for x in m)

In Factor, it might look like this:

: bwt ( seq -- i seq' )
    dup all-rotations natural-sort
    [ index ] [ [ last ] map ] bi ;

In Python, the inverse transform looks like this:

def ibwt(k, s):
    def row(k):
        permutation = sorted((t, i) for i, t in enumerate(s))
        for _ in s:
            t, k = permutation[k]
            yield t
    return ''.join(row(k))

In Factor, that roughly translates to:

: ibwt ( i seq -- seq' )
    [ length ] [ <enum> sort-values ] bi
    '[ _ nth first2 ] replicate nip ;

An improved version 2 is available in the development version. In particular, it uses rotated virtual sequences for increased performance and returns transformations that match the type of the input sequence.

No comments: