At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.
Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search.
Let’s look at some code. Assume we have some sort of Node class/struct defined as follows
Node = MutableNamedTuple('Node', ['ch', 'transitions', 'results', 'fail'])
The ch variable will hold the character at the current node, transitions will hold all nodes we can transition to, results will be a list of results in case this node is a keyword we stored and fail are transitions in case of failure (more on this later). For example, storing the keyword Manning will create 7 nodes and for the node where ch=’g’, results will also hold Manning. One detail that I omitted is that we also need a root node whose fail transition is itself and where all other nodes fail transitions point to by default. Now let’s look at some very verbose code that creates the trie-like data structure without the fail transitions.
root = Node(ch=None,transitions=[],results=[],fail=None)
root.fail = root
queue = deque([root])
for keyword in keywords:
current_node = root
for char in keyword:
new_node = None
for transition in current_node.transitions:
if transition.ch == char:
new_node = transition
break
if new_node is None:
new_node = Node(ch=char)
current_node.transitions.append(new_node)
if current_node is root:
new_node.fail = root
current_node = new_node
current_node.results.append(keyword)
This first part is simple. We go keyword by keyword and character by character. For each new character we need to add, we check if our current node already has a transition to that character (lines 9-11). If it doesn’t we create a new node and add it to the list of transitions (lines 14-18). Finally, after we have gone through all characters in a keyword, we add that keyword to the results list of the node we are currently at.
The second part is where the coolness behind Aho-Corasick lies. Constructing the failure transitions is simple, because we can simply perform a Breadth-First Search update the failure links as we go. Note that a couple of things in the code sample below differ from the original paper. First, we already set the failure transition to root for all nodes with root as the parent, as well as the root node itself — the original paper calls these states with depth d=1; also, we do not update the “output function” during our BFS as the original paper does because the results list is created during the previous step.
while queue:
current_node = queue.popleft()
for node in current_node.transitions:
queue.append(node)
fail_state_node = current_node.fail
while not any(x for x in fail_state_node.transitions if node.ch == x.ch) and fail_state_node is not root:
fail_state_node = fail_state_node.fail
node.fail = next((x for x in fail_state_node.transitions if node.ch == x.ch and x is not node), root)
return root
Now, given our completed data structure, searching is trivial and lends itself nicely to a generator construction.
current_node = tree_root
for c in text:
transition = None
while transition is None and transition is not tree_root:
transition = next((x for x in current_node.transitions if x.ch == c), current_node.fail)
current_node = transition
if len(transition.results) > 0:
for result in transition.results:
yield result
current_node = transition.fail
For each character in text, we move through our data structure and if there is a suitable transition we make it; otherwise we transition to the failure link (line 5). If we are at a node with a non-empty results list (output function in the original paper), then we return that result and continue. Let’s see it in action:
In [1]: import tree as ac In [2]: aho = ac.build_tree(['Brady', 'Manning', 'Johnson', 'Ochochinco']) In [3]: gen = ac.search(aho, "Brady is a better QB than Manning.") In [4]: gen.next() Out[4]: 'Brady' In [5]: gen.next() Out[5]: 'Manning' In [6]: gen.next() --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) projects/aho/<ipython-input-6-b2c61ce5e131> in <module>() ----> 1 gen.next() StopIteration
The algorithm has been around since 1975, so there are plenty of efficient implementations out there — the Wikipedia page mentions a few. Happy searching!