subarray
The subarray operator selects cells of an input array according to coordinates specified by one or more secondary arrays, called pick arrays.
Synopsis
subarray ( INPUT, PICK_0 [ , PICK_n ]* [ , join:BOOLEAN ] [, strict:BOOLEAN ] [, inverse:BOOLEAN ]
[ , algorithm:STRING ] )
Summary
subarray produces a sparse result array with the same dimension specification as its input array (but without overlaps). Only input cells selected by the pick array(s) appear in the output.
If the
join:true
option is present, the output cells will have additional attributes from the pick array cell(s) that selected them.If the
strict:true
option is present:Out-of-bounds or null pick values will cause an error. By default, null or out-of-bounds picks are ignored.
If
join:true
is also set, an ambiguous joined attribute will cause an error. See Duplicated Or Out-of-order Picks below.
If the
inverse:true
option is present, the result contains those input cells not selected by the pick array(s). If set,join:true
is not allowed.The
algorithm:
option forces use of a particular algorithm regardless of system configuration. Ordinarily you should not need to specify this option. See Algorithm Configuration below.
Pick Arrays
A pick array specifies input cell coordinates by:
Having an
int64
attribute with the same name as a dimension of the input array. A list of coordinates is built from the (non-null) values of the attribute.Having a dimension with the same name as a dimension of the input array. A list of coordinates is built from the locations of non-empty cells along that pick array dimension.
A particular input array dimension name may be matched by zero or one pick arrays. If the name appears in more than one pick array schema, an error occurs.
For example, given an input array with schema A<v:int64>[row=0:*; col=0:*]
, the pick array P<row:int64>[i=0:*]
AFL% scan(P);
{i} row
{0} 5
{1} 17
selects rows five and seventeen by matching an attribute name with one of the dimensions of the input array. Likewise, the sparse pick array Q<s:string>[row=0:*]
AFL% scan(Q);
{row} s
{5} 'Hello'
{17} 'world'
selects the same rows by having non-empty cells along a matching dimension name.
Each pick array is processed in turn to build up internal lists of coordinates, called picks, for each input array dimension. If no picks are given for a particular input array dimension, then all coordinates along that input dimension are treated as picked. An input array dimension with no picks is called a wildcard dimension.
Interpreting Picks: By Grid vs. By Cell
The fields (that is, the attributes and dimensions) of a pick array can match one or more dimensions of the input array. When each of the provided pick arrays matches only a single input array dimension, subarray selects input cells by grid. If any of the pick arrays matches more than one input dimension, then subarray selects input cells by cell position.
Grid Selection
Selection by grid means that the positions of the selected input cells are elements of the grid formed by the Cartesian product of the per-dimension picks.
For example, consider the input array A<v:int64>[row=0:*; col=0:*]
Given a pick array R for the rows
AFL% show(R);
{i} schema,distribution,etcomp
{0} 'R<row:int64> [i=0:*:0:1000000]','hashed','none'
AFL% scan(R);
{i} row
{0} 1
{1} 2
and C for the columns
then subarray(A, R, C) uses ``grid selection'' to choose the input cells:
R picks rows 1 and 2 by row
attribute value. C picks columns 0 and 2 because cells are present at those coordinates along the col
dimension. So subarray(A, R, C) yields
where ε denotes an empty cell position.
Cell-wise Selection
When a pick array has fields matching more than one input array dimension, each cell of the pick array is interpreted as a fixed tuple of picks for those dimensions. These pick tuples don't define a grid, instead they enumerate particular full or partial coordinates of the desired input cells.
Consider the two pick arrays R and C from the previous example. Used independently in subarray(A, R, C) (or subarray(A, C, R)), they invoke grid selection using the Cartesian product of the picked indices:
However, if the picks are combined in the same array, they denote a set of input cell coordinates rather than a Cartesian product. Here the single array RC is not equivalent to the two arrays R and C above:
Since the pick fields row
and col
of RC cover all of the dimensions of the input array, RC serves as an explicit list of input cells to be selected.
Hybrid Grid/Cell Selection
Pick fields within the same pick array may constitute a set of full coordinates, as in the previous section. They may also constitute a set of partial coordinates.
For example, given an input array VARIANT with schema <ref:string, alt:string>[varid; chrom; pos]
, a pick array CHROM_POS with schema <chrom:int64, pos:int64>[i]
can be used to select particular (chrom, pos) pairs, either across all variant ids (no other pick arrays specified, so varid
is a wildcard dimension) or across a selection of varid
values provided in another pick array.
This hybrid approach selects input array cells from the Cartesian product of the pick fields as grouped by pick array. In this example the product is
and not
If no pick array with a varid
field is provided, these products are the same: the {varid} set is just larger, because the dimension is wildcarded.
Given a CHROM_POS array and subarray invocation like this
the resulting selection would be wildcarded on the varid
dimension, and look something like
CHROM_POS selects particular points in the chrom-pos plane, and the varid dimension is a wildcarded term in the Cartesian product, so all variant ids in the input array are selected.
Duplicated Or Out-of-order Picks
During subarray execution, pick arrays are scanned to obtain the pick values prior to any scanning of the input array. The index values (or, for hybrid selection, value tuples) picked for each input array dimension are sorted and duplicate values removed. Unlike array indexing in R or NumPy, subarray cannot be used to introduce duplicate cells, or to rearrange the rows or columns of the input array.
When there are duplicate picks and the join:true
option is present, the choice of pick array cell to use to obtain the joined attributes is non-deterministic. For example, suppose a pick dataframe <x:int64, who:string>
contains two cells (17, 'Alice)
and (17, 'Bob')
that both pick the x dimension value 17. There is no way to know which who
value, 'Alice'
or 'Bob'
, will be used to generate the result array. If the strict:true
option is present, this situation is considered an error and the query is aborted.
Dataframes As Pick Arrays
SciDB dataframes can be used as pick arrays, but not as input arrays. Like ordinary pick arrays, pick indices and tuples taken from dataframe attributes are sorted and de-duplicated prior to scanning the input array. (Since dataframe cells are unordered, this behavior is especially necessary to avoid non-deterministic results.)
Dataframe dimensions are a hidden internal detail and cannot be used for pick indices.
Input Array Overlaps Are Ignored
If the input array contains overlap areas, they are ignored. No overlap areas appear in subarray output, and the overlap parameter for all subarray result schema dimensions is always zero. For example, the three-cell overlap areas from an input array with dimensions [x=0:*:3:1024; y=0:*:3:128]
are dropped, and the result schema dimensions will be [x=0:*:0:1024; y=0:*:0:128]
.
Algorithm Configuration
Subarray chooses its algorithm based on an estimate of whether or not the pick array data will fit in memory. By default, subarray computes the total estimated in-memory size of all pick array data and compares it to the configured value of subarray-arena-limit
or, if that value isn’t configured, to merge-sort-buffer
. If the total is over the allowed limit, then the largest pick array will be loaded into a “MemArray” data structure that can spill to disk, and a new total is computed for the remaining pick arrays. When all remaining pick arrays are within the limit, they will be loaded into faster in-memory hash tables. This hybrid approach is suitable for most queries.
You can force all pick arrays to load in a particular way by using the algorithm:
option. If specified, the value of this option must be one of the following strings:
'hash'
– All pick arrays will be loaded into in-memory hash tables regardless of size. If they do not actually fit in memory, an error will occur and the query will abort. You may need to adjust thesubarray-arena-limit
configuration parameter to use this algorithm.'memarray'
– All pick arrays will be loaded into spill-to-disk MemArray data structures. Used primarily during testing for ensuring correctness of the MemArray representation.'shapedpickarray'
– A legacy algorithm used during early development. Not recommended.