builtin_grouped_aggregate

Operator for aggregation grouped by attributes, dimensions or combinations thereof.

Changes from plugin builtin_grouped_aggregate

The result is a DataFrame.
Support for the LogicalFlow API, especially useful for allowing pushdown projection
to eliminate input attributes which either do not affect any of the aggregates, or only affect
aggregates which are not used later in the query.

Example

$ iquery -anq "store(apply(build(<a:double>[x=1:10,5,0,y=1:10,5,0], random()%10), b, string(random()%2)),test)"
Query was executed successfully

$ iquery -aq "builtin_grouped_aggregate(test, sum(a) as total, b)"
b,total
'0',203
'1',282

$ iquery -aq "builtin_grouped_aggregate(test, avg(a), count(*), b, x, output_chunk_size:10000)"
b,x,a_avg,count
'1',6,4.77778,9
'0',10,5.4,5
'0',7,5,4
'0',6,7,1
'0',5,3.85714,7
'1',1,5.5,8
'1',8,5,4
'1',10,5.8,5
'0',1,0.5,2
'0',9,5.5,4
'1',7,4,6
'1',3,5.16667,6
'1',9,6.33333,6
'1',5,5.33333,3
'0',3,5.75,4
'0',8,5,6
'0',4,4,7
'1',4,5.66667,3
'0',2,3,6
'1',2,5,4

More formally

builtin_grouped_aggregate(input_array, aggregate_1(input_1) [as alias_1], group_1,
                  [, aggregate_2(input_2),...]
                  [, group_2,...]
                  [, setting:value])

Where
  input_array            :: any SciDB array
  aggregate_1...N        :: any SciDB-registered aggregate
  input_1..N             :: any attribute in input
  group_1..M             :: any attribute or dimension in input
The operator must be invoked with at least one aggregate and at least one group.

Optional tuning settings:
  input_sorted:<true/false>     :: a hint that the input array is sorted by groups, or that, generally,
                                   aggregate group values are likely repeated often. Defaults to 1 (true)
                                   if aggregating by non-last dimension, 0 otherwise.
  max_table_size:MB             :: the amount of memory (in MB) that the operator's hash table structure
                                   may consume. Once the table exceeds this size, new aggregate groups
                                   are placed into a spillover array. Defaults to the merge-sort-buffer
                                   configuration setting.
  num_hash_buckets:N            :: the number of hash buckets to allocate in the hash table. Larger
                                   values improve speed but also use more memory. Should be a prime.
                                   Default is based on the max_table_size.
  spill_chunk_size:C            :: the chunk size of the spill-over array. Defaults to 100,000. Should
                                   be smaller if there are are many of group-by attributes or aggregates.
                                   TBD: automate
  merge_chunk_size:C            :: the chunk size of the array used to transfer data between instances.
                                   Defaults to 100,000. Should be smaller if there are many group-by
                                   attributes or instances. TBD: automate.
  output_chunk_size:C           :: the chunk size of the final output array. Defaults to 100,000.
                                   TBD: automate.

Returned array contains one attribute for each group, and one attribute for each aggregated value.
The dimensions are superfluous.

When grouping by attributes, an attribute value of null (or any missing code) constitutes an invalid
group that is not included in the output. All inputs associated with invalid groups are ignored.
When grouping by multiple attributes, a null or missing value in any one of the attributes makes the
entire group invalid.

SciDB Documentation

builtin_grouped_aggregate

Analytics

Changes from plugin builtin_grouped_aggregate

Example

More formally

Related content