builtin_grouped_aggregate

Operator for aggregation grouped by attributes, dimensions or combinations thereof.

Changes from plugin builtin_grouped_aggregate

  • The result is a DataFrame.

  • Support for the LogicalFlow API, especially useful for allowing pushdown projection
    to eliminate input attributes which either do not affect any of the aggregates, or only affect
    aggregates which are not used later in the query.

Example

$ iquery -anq "store(apply(build(<a:double>[x=1:10,5,0,y=1:10,5,0], random()%10), b, string(random()%2)),test)" Query was executed successfully $ iquery -aq "builtin_grouped_aggregate(test, sum(a) as total, b)" b,total '0',203 '1',282 $ iquery -aq "builtin_grouped_aggregate(test, avg(a), count(*), b, x, output_chunk_size:10000)" b,x,a_avg,count '1',6,4.77778,9 '0',10,5.4,5 '0',7,5,4 '0',6,7,1 '0',5,3.85714,7 '1',1,5.5,8 '1',8,5,4 '1',10,5.8,5 '0',1,0.5,2 '0',9,5.5,4 '1',7,4,6 '1',3,5.16667,6 '1',9,6.33333,6 '1',5,5.33333,3 '0',3,5.75,4 '0',8,5,6 '0',4,4,7 '1',4,5.66667,3 '0',2,3,6 '1',2,5,4

More formally

builtin_grouped_aggregate(input_array, aggregate_1(input_1) [as alias_1], group_1, [, aggregate_2(input_2),...] [, group_2,...] [, setting:value]) Where input_array :: any SciDB array aggregate_1...N :: any SciDB-registered aggregate input_1..N :: any attribute in input group_1..M :: any attribute or dimension in input The operator must be invoked with at least one aggregate and at least one group. Optional tuning settings: input_sorted:<true/false> :: a hint that the input array is sorted by groups, or that, generally, aggregate group values are likely repeated often. Defaults to 1 (true) if aggregating by non-last dimension, 0 otherwise. max_table_size:MB :: the amount of memory (in MB) that the operator's hash table structure may consume. Once the table exceeds this size, new aggregate groups are placed into a spillover array. Defaults to the merge-sort-buffer configuration setting. num_hash_buckets:N :: the number of hash buckets to allocate in the hash table. Larger values improve speed but also use more memory. Should be a prime. Default is based on the max_table_size. spill_chunk_size:C :: the chunk size of the spill-over array. Defaults to 100,000. Should be smaller if there are are many of group-by attributes or aggregates. TBD: automate merge_chunk_size:C :: the chunk size of the array used to transfer data between instances. Defaults to 100,000. Should be smaller if there are many group-by attributes or instances. TBD: automate. output_chunk_size:C :: the chunk size of the final output array. Defaults to 100,000. TBD: automate. Returned array contains one attribute for each group, and one attribute for each aggregated value. The dimensions are superfluous. When grouping by attributes, an attribute value of null (or any missing code) constitutes an invalid group that is not included in the output. All inputs associated with invalid groups are ignored. When grouping by multiple attributes, a null or missing value in any one of the attributes makes the entire group invalid.