If you are doing any sort of classification with thousands of features (as is common in text classification) and need to update your Bag of Words models online, then you’ll find this rather simple technique very handy. At some point while developing Sidelines, we were using LDA to cluster tens of thousands of articles in topics (this is no longer the case however). After extracting the text we would: 1) Remove stop words, 2) Construct a dictionary to map each word to a unique id, frequency tuple and 3) Prune that dictionary to remove very rarely used and very often used words. The biggest problem we faced was running out of memory as we’d have to make a pass over all our data to construct this dictionary. Even if we tried to segment our corpus and split them into different machines, keeping these huge dictionaries in memory was a pain, while swapping to disk was extremely slow. Continue reading The Hashing Trick
Most introductory textbooks on supervised learning algorithms examine the algorithm from a single classification point of view, meaning we only need to decide if an item belongs to a certain class or not. Multi-class classification, an instance of the same problem is typically done by training multiple such binary classifiers for each pair of classes. There are various improvements over these techniques, but very few papers examine how to get meaningful probability estimates out of a learning algorithm.
At Sidelines, we use the probability output of some of our classifiers to predict how relevant a sports news article is to a specific team. We then use that input as part of our ranking algorithm. For example, this article is mostly about the Boston Celtics and it should rank high in a Celtics feed, but Mavericks’ fans would also be somewhat interested although it should not rank as high in their feed. Continue reading Making sense out of Naive Bayes probability estimates
At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.
Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search. Continue reading String searching using Aho-Corasick
This is one of my favorite results of all times. A really simple neat trick that any developer can use in a variety of situations without really having to understand how it works. Think of the following hypothetical: you have a bunch of bins and a bunch of balls and you want to load up the bins as evenly as possible. If the bins and the balls were of same size, a round-robin strategy would work well. If they were not, you’d expect that just putting a ball in a bin at random would work well, however here we propose a different scheme that will perform better. Pick two bins at random and put the ball in the least loaded bin.
Why this performs better is out of the scope of this post, there are a lot of papers that discuss complex proofs on the topic. Instead lets take an example from the systems domain. You have N servers that accept web requests. Let us assume that there are m requests. Each server is stateless, so you can route each request to any server you want. Here are some options on how to do it: (1) Round-robin, (2) Route to a random server, (3) Route to the least loaded server, (4) Pick k servers at random and route to the least loaded one. The last approach will outperform the first three, especially when:
- Requests have different resource demands
- The server farm is heterogeneous
- The information about which server is least loaded is stale
- Any combination of the above
Azar shows the above strategy works better than simple random choice. Talwar and Wieder of MSR, show in their paper that in case #1 the load gap, meaning the difference between the average and most loaded server, does not depend on the number of requests and thus scales infinitely with the number of servers. Mitzenmacher in his paper, shows how increasing the number k of randomly selected servers can help with stale information about server load.
Continuing on the same note as the Greenwald-Khanna post, we discuss another novel data structure for order statistics called Q-Digest. Compared to GK, Q-Digest is much simpler, easier to implement and extends to a distributed solution very easily. The data structure was designed with sensor networks in mind, where minimizing radio transmission overhead is of paramount importance. It originally appeared in this paper by Shrivastava et al
Q-Digest is simply a complete binary tree over the range of values, where the leaf nodes are the values themselves. In the example below, we have a digest over the values 1 to 4 inclusive, and we have seen the value 1 once and the value 4 seven times.
The novelty is the compression algorithm which allows values with low frequencies to propagate up the tree and be combined. In the example below, we have some information loss. We know that there are 10 values between 1 and 2 inclusive, but we don’t know the exact counts of each
Assume you have a web farm and you are collecting response times for user requests. You could have hundreds of millions or even billions of data points and you want to know how your service is doing. You could maintain a running average, but averages are sensitive to outliers, not to mention that you’d need to also look at the standard deviation. Quantiles are a better way to summarize your data. They allow you to answer questions such as “what percent of requests finished faster than x milliseconds” or “what is the maximum response time for 90% of my requests”. Problem with quantiles is that you normally need to keep all the data in memory. If you lets say have 1 billion data points, you’d need about 3.5GB of RAM to store each as an integer.
Here we examine an algorithm that continuously maintains summary information and can answer queries within a small bounded error using only a fraction of the memory that would normally be required. There are numerous algorithms that solve this problem, but the one we will examine is due to Greenwald and Khanna, in their 2001 paper “Space efficient online computation of quantile summaries“. It is more complex to understand than a lot of the other algorithms but it is, as far as I know, the one with the best space usage. Also, a lot of other papers we will examine later use it to solve interesting problems.