Assume you have a web farm and you are collecting response times for user requests. You could have hundreds of millions or even billions of data points and you want to know how your service is doing. You could maintain a running average, but averages are sensitive to outliers, not to mention that you’d need to also look at the standard deviation. Quantiles are a better way to summarize your data. They allow you to answer questions such as “what percent of requests finished faster than x milliseconds” or “what is the maximum response time for 90% of my requests”. Problem with quantiles is that you normally need to keep all the data in memory. If you lets say have 1 billion data points, you’d need about 3.5GB of RAM to store each as an integer.
Here we examine an algorithm that continuously maintains summary information and can answer queries within a small bounded error using only a fraction of the memory that would normally be required. There are numerous algorithms that solve this problem, but the one we will examine is due to Greenwald and Khanna, in their 2001 paper “Space efficient online computation of quantile summaries“. It is more complex to understand than a lot of the other algorithms but it is, as far as I know, the one with the best space usage. Also, a lot of other papers we will examine later use it to solve interesting problems.
Greenwald-Khanna (GK from now on), is based on a simple data structure. We simply maintain tuples of (v, g, Δ) and use those tuples when querying to find an estimate. In the tuple, v is our value from the data stream. The value for g is such that adding all values of g from the first tuple to any other tuple would give the minimum rank of the value for that tuple. Then adding Δ to that sum would give the maximum rank for that tuple. The algorithm does insertions and compressions carefully to keep as few tuples around as possible while always maintaining that invariant that for any tuple we have g+Δ <= 2 * ε * n. As long as we can keep that in mind, the algorithm will always give correct ε-approximate answers. Valid values for ε largely depend on what kind of queries you want to do. For example, if you want the 90th percentile then ε=0.1 is ok, for the 99th percentile ε=0.01 is ok and so on.
The GK Insert() algorithm is straightforward. It scans the list of tuples we maintain to find the right place where the new tuple should be inserted. That tuple is always {value, 1, floor(2 * ε * n) }, where ε is the precision we are aiming for and n is the number of elements seen so far. An exception to the above is when inserting the new maximum or minimum. In that case, the tuple is {value, 1, 0}.
Insert(double v) {
Tuple newValue = new Tuple(v, 1);
int idx = 0;
foreach (Tuple t in S) {
if (v < t.value)
break;
idx++;
}
if (idx == 0 || idx == S.Count) // the new element is the new min or max
newValue.delta = 0;
else
newValue.delta = Floor(2 * epsilon * N);
S.Insert(idx, newValue);
++N;
Compress();
}
The heart of the GK algorithm though lies with the Compress() function. This is where we get rid of unnecessary tuples while maintaining the invariants necessary to ensure that we can give correct answers.
for (int i = 0; i < S.Count - 1; ++i) {
if (S[i].g + S[i + 1].g + S[i + 1].delta <= Floor(2 * epsilon * N)) {
S[i + 1].g += S[i].g;
S.RemoveAt(i);
}
}
This simplified compress code walks the list and merges pairs as long as the value of g for the new pair plus Δ does not violate our invariant. Finally, we can now easily query our data structure. Recall that for any tuple, the minimum rank for that value is the sum of g up to that tuple and the maximum rank is found by adding Δ to that sum. So simply iterate over the list of tuples and stop once you find a tuple having a value greater than the element you are searching followed by a tuple having a value less than the element. To algorithm to compute the quantile is given below:
for (int i = 0; i < S.Count - 1; ++i) {
rMin += S[i].g;
rMax = rMin + S[i].delta;
if (r - rMin <= epsilon*N && rMax - r <= epsilon*N) {
return S[i].value;
}
}
The variable r above is the rank we are looking for. The above works by continuously calculating the minimum and maximum possible ranks for each tuple as per our initial definitions. If our wanted rank falls in that range, then we return the value for that tuple.
For more information on GK, I encourage you to read the original paper.