builtin_grouped_aggregate
Operator for aggregation grouped by attributes, dimensions or combinations thereof.
Changes from plugin builtin_grouped_aggregate
The result is a DataFrame.
Support for the LogicalFlow API, especially useful for allowing pushdown projection
to eliminate input attributes which either do not affect any of the aggregates, or only affect
aggregates which are not used later in the query.
Example
$ iquery -anq "store(apply(build(<a:double>[x=1:10,5,0,y=1:10,5,0], random()%10), b, string(random()%2)),test)"
Query was executed successfully
$ iquery -aq "builtin_grouped_aggregate(test, sum(a) as total, b)"
b,total
'0',203
'1',282
$ iquery -aq "builtin_grouped_aggregate(test, avg(a), count(*), b, x, output_chunk_size:10000)"
b,x,a_avg,count
'1',6,4.77778,9
'0',10,5.4,5
'0',7,5,4
'0',6,7,1
'0',5,3.85714,7
'1',1,5.5,8
'1',8,5,4
'1',10,5.8,5
'0',1,0.5,2
'0',9,5.5,4
'1',7,4,6
'1',3,5.16667,6
'1',9,6.33333,6
'1',5,5.33333,3
'0',3,5.75,4
'0',8,5,6
'0',4,4,7
'1',4,5.66667,3
'0',2,3,6
'1',2,5,4
More formally
builtin_grouped_aggregate(input_array, aggregate_1(input_1) [as alias_1], group_1,
[, aggregate_2(input_2),...]
[, group_2,...]
[, setting:value])
Where
input_array :: any SciDB array
aggregate_1...N :: any SciDB-registered aggregate
input_1..N :: any attribute in input
group_1..M :: any attribute or dimension in input
The operator must be invoked with at least one aggregate and at least one group.
Optional tuning settings:
input_sorted:<true/false> :: a hint that the input array is sorted by groups, or that, generally,
aggregate group values are likely repeated often. Defaults to 1 (true)
if aggregating by non-last dimension, 0 otherwise.
max_table_size:MB :: the amount of memory (in MB) that the operator's hash table structure
may consume. Once the table exceeds this size, new aggregate groups
are placed into a spillover array. Defaults to the merge-sort-buffer
configuration setting.
num_hash_buckets:N :: the number of hash buckets to allocate in the hash table. Larger
values improve speed but also use more memory. Should be a prime.
Default is based on the max_table_size.
spill_chunk_size:C :: the chunk size of the spill-over array. Defaults to 100,000. Should
be smaller if there are are many of group-by attributes or aggregates.
TBD: automate
merge_chunk_size:C :: the chunk size of the array used to transfer data between instances.
Defaults to 100,000. Should be smaller if there are many group-by
attributes or instances. TBD: automate.
output_chunk_size:C :: the chunk size of the final output array. Defaults to 100,000.
TBD: automate.
Returned array contains one attribute for each group, and one attribute for each aggregated value.
The dimensions are superfluous.
When grouping by attributes, an attribute value of null (or any missing code) constitutes an invalid
group that is not included in the output. All inputs associated with invalid groups are ignored.
When grouping by multiple attributes, a null or missing value in any one of the attributes makes the
entire group invalid.