String searching using Aho-Corasick

At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.

Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search.

Let’s look at some code. Assume we have some sort of Node class/struct defined as follows

Node = MutableNamedTuple('Node', ['ch', 'transitions', 'results', 'fail'])

The ch variable will hold the character at the current node, transitions will hold all nodes we can transition to, results will be a list of results in case this node is a keyword we stored and fail are transitions in case of failure (more on this later).  For example, storing the keyword Manning will create 7 nodes and for the node where ch=’g’, results will also hold Manning. One detail that I omitted is that we also need a root node whose fail transition is itself and where all other nodes fail transitions point to by default. Now let’s look at some very verbose code  that creates the trie-like data structure without the fail transitions.

root = Node(ch=None,transitions=[],results=[],fail=None)
root.fail = root
queue = deque([root])

for keyword in keywords:
    current_node = root
    for char in keyword:
        new_node = None
        for transition in current_node.transitions:
            if transition.ch == char:
                new_node = transition
                break

        if new_node is None:
            new_node = Node(ch=char)
            current_node.transitions.append(new_node)
            if current_node is root:
                new_node.fail = root

        current_node = new_node
    current_node.results.append(keyword)

This first part is simple. We go keyword by keyword and character by character. For each new character we need to add, we check if our current node already has a transition to that character (lines 9-11). If it doesn’t we create a new node and add it to the list of transitions (lines 14-18). Finally, after we have gone through all characters in a keyword, we add that keyword to the results list of the node we are currently at.
The second part is where the coolness behind Aho-Corasick lies. Constructing the failure transitions is simple, because we can simply perform a Breadth-First Search update the failure links as we go. Note that a couple of things in the code sample below differ from the original paper. First, we already set the failure transition to root for all nodes with root as the parent, as well as the root node itself — the original paper calls these states with depth d=1; also, we do not update the “output function” during our BFS as the original paper does because the results list is created during the previous step.

while queue:
    current_node = queue.popleft()
    for node in current_node.transitions:
        queue.append(node)
        fail_state_node = current_node.fail
        while not any(x for x in fail_state_node.transitions if node.ch == x.ch) and fail_state_node is not root:
            fail_state_node = fail_state_node.fail

        node.fail = next((x for x in fail_state_node.transitions if node.ch == x.ch and x is not node), root)

return root

Now, given our completed data structure, searching is trivial and lends itself nicely to a generator construction.

current_node = tree_root
for c in text:
    transition = None
    while transition is None and transition is not tree_root:
        transition = next((x for x in current_node.transitions if x.ch == c), current_node.fail)

    current_node = transition

    if len(transition.results) > 0:
        for result in transition.results:
            yield result
        current_node = transition.fail

For each character in text, we move through our data structure and if there is a suitable transition we make it; otherwise we transition to the failure link (line 5). If we are at a node with a non-empty results list (output function in the original paper), then we return that result and continue. Let’s see it in action:

In [1]: import tree as ac
In [2]: aho = ac.build_tree(['Brady', 'Manning', 'Johnson', 'Ochochinco'])
In [3]: gen = ac.search(aho, "Brady is a better QB than Manning.")
In [4]: gen.next()
Out[4]: 'Brady'
In [5]: gen.next()
Out[5]: 'Manning'
In [6]: gen.next()
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
projects/aho/<ipython-input-6-b2c61ce5e131> in <module>()
----> 1 gen.next()
StopIteration

The algorithm has been around since 1975, so there are plenty of efficient implementations out there — the Wikipedia page mentions a few. Happy searching!

About these ads

3 thoughts on “String searching using Aho-Corasick

  1. Anonymous

    This algorithm has an error as you’ve coded it.

    Instead of continuing if there is a result, it acts like there was a failure. So if I was looking for ‘Pat’ and ‘Patton’ in a block of text that had the word ‘Patton’, it would only find ‘Pat’.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s