Boyer Moore Algorithm: Desember 2014

Minggu, 28 Desember 2014

Boyer Moore Fast string searching algorithm

“ We have seen a strstr implementation in previous articles. It takes O(n) time and not the best string search algorithm when it comes to large text. Lets say, you need search a 1 GB file. We need to optimize string search in some way”

See also
http://www.movsd.com/bm.htm
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
http://analgorithmaday.blogspot.com/2011/03/string-of-string-search.html
Introduction
     I didn’t find much examples and easy to follow code on the internet except some of the links given above. Read through it before getting some idea behind this optimized algorithm.
The main optimization lies in Character jump heuristics. We skip some unnecessary comparisons by starting the comparison between the source and the pattern, from the end of the pattern instead of the start of the pattern.
In our str_str implementation, we start with the initial character in the pattern. But here, we do it from the end. But, If you move from backwards, what you get. You can skip further comparisons to the start of the pattern. But how we decide we can skip further comparison?
Concept
   The skip we do if there is a mismatch in last letter comparison follows little bit of size heuristics. All the text search/match algorithm follows this strict and very simple to follow heuristics.
Example,
Source:   I think am in love with you
Pattern:   love
Since we compare from the end of the pattern, “e” and “h” are compared. Both are not matching, since we matched the pattern of size of “4” with source of size “4” and found that the last “e” and “h” are not matching, we can jump!!.. This is also known as the good suffix and uses the simple size heuristics.
What is character jump heuristics then?. Some time, there might be some character matching and some would not.
Example,
Source:   aabaaabbbbbbaaaaabbabaaaaaaaaaa
Pattern:   bab
If we start comparing the above two strings, you get “ab” at many places. But at the third comparison, there are failures. This is a bad suffix problem.
In this case, we need to take care of the position of mismatch, in first run, its “b” and “a”, in the pattern and shift. For this we create a table. This is called character heuristic table.
This table is an ASCII table of size 128 for each ASCII character. ‘a' = 97, ‘z’ =122, ‘A’=65, ‘Z’=90. So, for this pattern, the count will come like this,
b, a, b = 2, 1, 0
We create a integer array of size 128 and use character itself as an index to it as its C++. other languages can use hash tables etc :)
So, table[‘b’] =2 and table[‘a’]=1. For the above case, so, if a mismatch comes with ‘a’ in source, jump 1. if mismatch comes with ‘b’ in source, jump 2 in source. In the above case, the jumps are less but when b compares with a, we shift more. :)
The above case is little bit like worst case for our algorithm :)
This algorithm gives linear time O(n) in worst case but most of the cases, works with less than O(n) time.
Code

 
int boyer_moore(char* sstr, char* pattern)
{
    char* inits = sstr;
    char* initp = pattern;
 
    int spatt = strlen(pattern);
    while(*pattern != '\0') pattern++;
    // this algorithm tested for printable ASCII characters
    // from ASCII, 65-90 and 97-122
    int* jump_table=(int*) calloc(128, sizeof(int));
    int count=0;
 
    while(pattern != initp) {
        pattern--;
        jump_table[*pattern]=count;
        count++;
    }
 
    char* stmp=0;
    char* iter=0;
    int shift=0;
    int bad_count=0;
 
    while(*sstr != '\0')
    {
        bad_count=spatt;
        stmp = sstr+ (spatt-1);
        iter = pattern + (spatt-1);
        while(*stmp == *iter) {
            bad_count--;
            stmp--;
            iter--;
            if(bad_count==0)
                return sstr-inits;
        }
 
        //jump table
        (jump_table[*stmp] == 0)?shift=bad_count:shift=jump_table[*stmp];
        sstr += shift;
    }
    //not found
    return -1;
}
 
 
void main()
{
    char* source = "aabaaabbbbbbaaaaabbabaaaaaaaaaa";
    char* pattern = "bab";
 
    std::cout<<boyer_moore(source, pattern);
}

Important points

This is plain copy of the concepts extracted from http://www.movsd.com/bm.htm
There is some terminology called good suffix, bad suffix.

good suffix= if we do a full shift of size equal to pattern
bad suffix = if we do partial shift

These terminologies are really simple and we are not using separate table for bad suffix.
bad suffix is calculated using bad_count. good_suffix is taken from jump_table[mismatch]
only if mismatch character is not in the pattern, we go for a bad shift which incurs more comparisons.
Not found is the case when sstr reaches null.
Problems while writing code is mainly the confusion of creating tables. Reverse comparison and main loop is almost similar to str_str.
The above code has a bug with repeated strings and patterns
Try the above code for below input, it FAILS

Source: aabaaabbbbbbaaaaabbabaaaaaaaaaa
Pattern: baaaa

The below code is the bug fix for adding additional heuristics for repeated characters in patterns. This is based on the concept given under “The additional heuristic” in http://www.movsd.com/bm.htm
We maintain the count of number of characters found and subtract that with the shift found from the pattern jump table. It will never go less than 1 or 0 in normal case. But in our special bug case, it will go to 0. :) So, we shift just 1 instead of shifting beyond with the jump_table since the calculation in the jump table replaces the pattern calculation.
For above, the pattern jump table will be like, table[a] = 4 and table[b]=5. the index of ‘a’s in other places are not considered.
It’s better to have the additional heuristic as well to make it work for all cases.

int boyer_moore(char* sstr, char* pattern)
{
    char* inits = sstr;
    char* initp = pattern;
 
    int spatt = strlen(pattern);
    while(*pattern != '\0') pattern++;
    // this algorithm tested for printable ASCII characters
    // from ASCII, 65-90 and 97-122
    int* jump_table=(int*) calloc(128, sizeof(int));
    int count=0;
 
    while(pattern != initp) {
        pattern--;
        jump_table[*pattern]=count;
        count++;
    }
 
    char* stmp=0;
    char* iter=0;
    int shift=0;
    int bad_count=0;
    int bcount=0;
    while(*sstr != '\0')
    {
        bcount=0;
        bad_count=spatt;
        stmp = sstr+ (spatt-1);
        iter = pattern + (spatt-1);
        while(*stmp == *iter) {
            bad_count--;
            bcount++;
            stmp--;
            iter--;
            if(bcount==spatt)
                return sstr-inits;
        }
 
        //jump table
        if(jump_table[*stmp] == 0) {
            // the character not found in pattern
            shift=bad_count;
        } else {
            shift=jump_table[*stmp];
            (shift - bcount < 1)?shift = 1: shift = shift-bcount;
        }
        sstr += shift;
    }
    //not found
    return -1;
}
 
 
void main()
{
    char* source = "aabaaabbbbbbaaaaabbabaaaaaaaaaa";
    char* pattern = "baaaa";
 
    std::cout<<boyer_moore(source, pattern);
}

The Boyer-Moore Fast String Searching Algorithm

The Boyer-Moore Fast String Searching Algorithm

This algorithm, which Bob Boyer and I invented in about 1975, is the basis of the fastest known ways to find one string of characters in another. How might you look for the following pattern in the text below?
pattern: EXAMPLE
text: HERE IS A SIMPLE EXAMPLE
We can show you the Knuth-Pratt-Morris algorithm and we can show you the Boyer-Moore algorithm.
Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes. The algorithm is described in

A Fast String Searching Algorithm, with R.S. Boyer. Communications of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.
The classic Boyer-Moore algorithm suffers from the phenomenon that it tends not to work so efficiently on small alphabets like DNA. The skip distance tends to stop growing with the pattern length because substrings re-occur frequently. By remembering more of what has already been matched, one can get larger skips through the text. One can even arrange ``perfect memory'' and thus look at each character at most once, whereas the Boyer-Moore algorithm, while linear, may inspect a character from the text multiple times. This idea of remembering more has been explored in the literature by others. It suffers from the need for very large tables or state machines.
However, in 2007, Matyas Sustik and I thought of an apparently new variation that gives very good performance on small alphabets and has a small table. The algorithm is described in UTCS Tech Report TR-07-62, ``String Searching over Small Alphabets.''

HERE IS A SIMPLE EXAMPLE Boyer-Moore

By fetching the S underlying the last character of the pattern we gain more information about matches in this area of the text.

We are not standing on a match (because S isn't E).
We wouldn't find a match even if we slid the pattern down by 1 (because S isn't L).
We wouldn't find a match even if we slid the pattern down by 2 (because S isn't P).
Etc.

In general, since S doesn't occur in the pattern at all we can slide the pattern down by its own length without missing a match. This will align the S we just found with an imaginary one just beyond the front of the pattern. This shift can be precalculated for every element of the alphabet and stored in a table.

Rabu, 24 Desember 2014

Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection

Abstract

Intrusion Detection Systems (IDSs) have become

widely recognized as powerful tools for identifying, deterring

and deflecting malicious attacks over the network.

Essential to almost every intrusion detection system is the

ability to search through packets and identify content that

matches known attacks. Space and time efficient string

matching algorithms are therefore important for identifying

these packets at line rate.

In this paper we examine string matching algorithms

and their use for Intrusion Detection. In particular, we focus

our efforts on providing worst-case performance that

is amenable to hardware implementation. We contribute

modifications to the Aho-Corasick string-matching algorithm

that drastically reduce the amount of memory required

and improve its performance on hardware implementations.

We also show that these modifications do

not drastically affect software performance on commodity

processors, and therefore may be worth considering in

these cases as well.

Keywords: System Design, Network Algorithms

I. INTRODUCTION

With each passing day there is more critical data accessible

in some form over the network. Any publicly

accessible system on the Internet today will be rapidly

subjected to break-in attempts. These attacks can range

from email viruses, to corporate espionage, to general destruction

of data, to attacks that hijack servers from which

to spread additional attacks. Even when a system cannot

be directly broken into, denial of service attacks can be

just as harmful to individuals, and can cause nearly equal

damage to the reputations of companies that provide services

over the Internet. Because of the increasing stakes

held by the various users of the internet, there has been

widespread interest in combating these attacks at every

level, from end hosts and network taps to edge and core

routers.

Intrusion Detection Systems (or IDSs) are emerging as

one of the most promising ways of providing protection

to systems on the network. The IDS market has been estimated

at $100 million by the Aberdeen Group, with expectations

that it will double in 2004 and keep growing in

future years. By automatically monitoring network traffic

in real time, intrusion detection systems can alert administrators

of suspicious activities, keep logs to aid in forensics,

and assist in the detection of new worms and denial

of service attacks.

As with firewalls, intrusion detection systems are growing

in popularity because they provide a site resilience to

attacks without modifying end-node software. While firewalls

only limit entry to a network based on packet headers,

intrusion detection systems go beyond this by identifying

possible attacks that use valid packet headers that

pass through firewalls. Intrusion detection systems gain

this capability by searching both packet headers and payloads

to identify attack signatures.

To define suspicious activities, an IDS makes use of a

set of rules which are applied to matching packets. A rule

consists at minimum of a type of packet to search, a string

of content to match, a location where that string is to be

searched for, and an associated action to take if all the conditions

of the rule are met. An example rule might match

packets that look like a known buffer overflow exploit in

a web server; the corresponding action might be to log the

packet information and alert the administrator.

Because of the utility of IDSs they are beginning to be

deployed in a wide range of operating environments. Endhosts

use them to monitor and prevent attacks from incoming

traffic. They can be found in network-tap devices that

are inserted into key points of the network for diagnostic

purposes. They will soon even find their way into edge

and core routers to protect the network infrastructure from

distributed attacks.

The challenge is that increasing line-rates and an explosion

in the number of attacks mounted as well as plummet-

The Boyer-Moore Approach to PAttuD. Matching

We will assume that the input z (y) is stored into the array te:;[1:nl

(patlern[l:m D.

The obvious way to locate all occurrences of y in z is by repeated aligning and

checxing from left to right. ODe innovative feature of the BM strategy is in that, for

eacb aHsnrncn! of the two strings, character comparisons are performed from right to

left, starti:lg at the right end of the pattern. As is well known, this contributes a

significant speed-up in cases of mism:ltch (efr. [6]), even though it leads to a quadratic

worst case behavior. A compact presentation of the eM aIgori!bm is given in

Description Boyer Moore

The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible.

Previous to the introduction of this algorithm, the usual way to search within text was to examine each character of the text for the first character of the pattern. Once that was found the subsequent characters of the text would be compared to the characters of the pattern. If no match occurred then the text would again be checked character by character in an effort to find a match. Thus almost every character in the text needs to be examined. The key insight in this algorithm is that if the end of the pattern is compared to the text then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match there is no need to continue searching backwards along the pattern. If the character in the text does not match any of the characters in the pattern, then the next character to check in the text is located n characters farther along the text, where n is the length of the pattern. If the character is in the pattern then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. The movement along the text in jumps to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the increase of the efficiency of the algorithm.

More formally, the algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving backward: the strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted to the right according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found.

The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.