From: friedman@cli.com (Noah Friedman)
To: kfogel@floss.life.uiuc.edu
Cc: jimb@gnu.ai.mit.edu
Subject: Re: boyer-moore string searching
Date: Wed, 8 Feb 95 23:01:49 CST

>        Hmm, Knuth doesn't seem to discuss it... is it recent?

Yeah, it postdates AOCP, though Knuth commented on an optimization to B-M
which the latter acknowledge in the last version of the paper.
Here's the reference:

    Robert S. Boyer, J Strother Moore
    "A Fast String Searching Algorithm"
    Communications of the ACM, V.20, No. 10, pp. 762-772; October 1977.

I only have a paper copy which is faded somewhat, but I could copy it for
you if you can't find it easily.

Basically, the idea is that you do your search starting from the end of
your pattern and work backwards.  If your pattern is patlen chars long,
start by looking at the patlen'th char in your search string.  If that char
isn't in the pattern anywhere, you can skip forward patlen chars without
examining the rest, because you know that no part of the string before that
can match.  (If it is, you have to compute how far forward you can
skip, but you build a table of "deltas" before you start).

For instance, consider searching for the string "string" in "the substring
to be found":


    string
    the substring to be found
         ^

`u' doesn't appear anywhere in "string", so skip forward 6 positions:

          string
    the substring to be found
               ^

This time the char `n' does appear in "string" at the patlen-1 position, so
you can only move forward one char.  But this time you've found the match,
and it only took 2+patlen comparisons instead of 9+patlen for the
traditional "search forward in both pattern and search-string" algorithm.
        
There are more details for further optimization that guarantee its
worst-case behavior is linear in patlen+i, but I'll let you read the paper
for yourself. :-)  Basically, it just amazes me how anything so obvious
could have been so unapparent before.