approxdc

The approxdc aggregate produces a result array containing approximate counts of the number of distinct values of an attribute.

Synopsis

AFL% aggregate(array,approxdc(attribute)[,dimension_1,dimension_2,...])

Summary

The approxdc aggregate takes a set of values from an array attribute and returns an approximate count of the number of distinct values present. You can optionally specify one or more dimensions to group by.

Getting a distinct count of an attribute is computationally expensive, and with live and changing data sets (with concurrent updates / inserts / deletes happening) the computed answer is imprecise anyway.

SciDB offers an imprecise but faster approximating method, because in many cases (especially when the numbers involved are very large), an approximate count is just as useful as an exact count. Often, it is better to get a number that's accurate to within 1-2% that takes 1/10th the time to compute.

  • The approxdc aggregate does not count null values.
  • The approxdc aggregate is a hybrid. If the number of distinct values is less than about 3,000, approxdc returns a precise count. If the number of distinct values is more than 3,000, the result is accurate to within about 1-2%.

Examples

Find the Approximate Number of Distinct Numbers in an Array

To quickly estimate the number of distinct values in an array:

  1. Create an array filled with 75 random numbers between 0 and 99:

    AFL% create array A <a:int64>[i=1:75];
    AFL% set no fetch;
    AFL% store(build(A, random()%100), A);
    AFL% set fetch;

    Output of store is suppressed by using set no fetch because it is long and results vary with each execution.
     

  2. Get an estimate of the number of distinct values of a:

    AFL% aggregate(A, approxdc(a)); 


    The output is:

    {i} a_approxdc
    {0} 53
  3. Compare this to an exact count of the unique values of a:

    AFL% aggregate(uniq(sort(A)), count(*))


    The output is:

    {i} count
    {0} 53

    Here, the answers are identical, however, on large arrays, aggregate(A, approxdc(a)) gives an approximation, but runs much faster than aggregate(uniq(sort(A)), count(*)).

Find the Approximate Number of Distinct Words in an Array

To find the approximate number of distinct words in an array or in various subsets of the array, do the following:

  1. Consider a file containing the 271 words of Lincoln's Gettysburg Address, one word per line:

    AFL% create array W <w:string>[i=1:271]
    AFL% load(W, 'gettysburgAddress.word-per-line.txt', -2, 'csv')


    The first few lines of output are:

    {i} w
    {1} 'Four'
    {2} 'score'
    {3} 'and'
    {4} 'seven'
    ...
  2. Show the approximate count of distinct words in the array:

    AFL% aggregate(W, approxdc(w)); 


    The output is:

    {i} w_approxdc
    {0} 142
  3. Compare this to an exact count of the unique words:

    AFL% aggregate(uniq(sort(W)), count(*))


    The output is:

    {i} count
    {0} 142

    Here, the answers are identical; however, on large arrays, aggregate(W, approxdc(w)) gives an approximation, but runs much faster than aggregate(uniq(sort(W)), count(*)).

Â