subarray

The subarray operator selects cells of an input array according to coordinates specified by one or more secondary arrays, called pick arrays.

Synopsis

subarray ( INPUT, PICK_0 [ , PICK_n ]* [ , join:BOOLEAN ] [, strict:BOOLEAN ] [, inverse:BOOLEAN ]
[ , algorithm:STRING ] )

Summary

subarray produces a sparse result array with the same dimension specification as its input array (but without overlaps). Only input cells selected by the pick array(s) appear in the output.

  • If the join:true option is present, the output cells will have additional attributes from the pick array cell(s) that selected them.

  • If the strict:true option is present:

    • Out-of-bounds or null pick values will cause an error. By default, null or out-of-bounds picks are ignored.

    • If join:true is also set, an ambiguous joined attribute will cause an error. See Duplicated Or Out-of-order Picks below.

  • If the inverse:true option is present, the result contains those input cells not selected by the pick array(s). If set, join:true is not allowed.

  • The algorithm: option forces use of a particular algorithm regardless of system configuration. Ordinarily you should not need to specify this option. See Algorithm Configuration below.

Pick Arrays

A pick array specifies input cell coordinates by:

  • Having an int64 attribute with the same name as a dimension of the input array. A list of coordinates is built from the (non-null) values of the attribute.

  • Having a dimension with the same name as a dimension of the input array. A list of coordinates is built from the locations of non-empty cells along that pick array dimension.

A particular input array dimension name may be matched by zero or one pick arrays. If the name appears in more than one pick array schema, an error occurs.

For example, given an input array with schema A<v:int64>[row=0:*; col=0:*], the pick array P<row:int64>[i=0:*]

AFL% scan(P); {i} row {0} 5 {1} 17

selects rows five and seventeen by matching an attribute name with one of the dimensions of the input array. Likewise, the sparse pick array Q<s:string>[row=0:*]

AFL% scan(Q); {row} s {5} 'Hello' {17} 'world'

selects the same rows by having non-empty cells along a matching dimension name.

Each pick array is processed in turn to build up internal lists of coordinates, called picks, for each input array dimension. If no picks are given for a particular input array dimension, then all coordinates along that input dimension are treated as picked. An input array dimension with no picks is called a wildcard dimension.

Interpreting Picks: By Grid vs. By Cell

The fields (that is, the attributes and dimensions) of a pick array can match one or more dimensions of the input array. When each of the provided pick arrays matches only a single input array dimension, subarray selects input cells by grid. If any of the pick arrays matches more than one input dimension, then subarray selects input cells by cell position.

Grid Selection

Selection by grid means that the positions of the selected input cells are elements of the grid formed by the Cartesian product of the per-dimension picks.

For example, consider the input array A<v:int64>[row=0:*; col=0:*]

Given a pick array R for the rows

AFL% show(R); {i} schema,distribution,etcomp {0} 'R<row:int64> [i=0:*:0:1000000]','hashed','none' AFL% scan(R); {i} row {0} 1 {1} 2

and C for the columns

then subarray(A, R, C) uses ``grid selection'' to choose the input cells:

R picks rows 1 and 2 by row attribute value. C picks columns 0 and 2 because cells are present at those coordinates along the col dimension. So subarray(A, R, C) yields


where ε denotes an empty cell position.

Cell-wise Selection

When a pick array has fields matching more than one input array dimension, each cell of the pick array is interpreted as a fixed tuple of picks for those dimensions. These pick tuples don't define a grid, instead they enumerate particular full or partial coordinates of the desired input cells.

Consider the two pick arrays R and C from the previous example. Used independently in subarray(A, R, C) (or subarray(A, C, R)), they invoke grid selection using the Cartesian product of the picked indices:

However, if the picks are combined in the same array, they denote a set of input cell coordinates rather than a Cartesian product. Here the single array RC is not equivalent to the two arrays R and C above:

Since the pick fields row and col of RC cover all of the dimensions of the input array, RC serves as an explicit list of input cells to be selected.

Hybrid Grid/Cell Selection

Pick fields within the same pick array may constitute a set of full coordinates, as in the previous section. They may also constitute a set of partial coordinates.

For example, given an input array VARIANT with schema <ref:string, alt:string>[varid; chrom; pos], a pick array CHROM_POS with schema <chrom:int64, pos:int64>[i] can be used to select particular (chrom, pos) pairs, either across all variant ids (no other pick arrays specified, so varid is a wildcard dimension) or across a selection of varid values provided in another pick array.

This hybrid approach selects input array cells from the Cartesian product of the pick fields as grouped by pick array. In this example the product is

and not

If no pick array with a varid field is provided, these products are the same: the {varid} set is just larger, because the dimension is wildcarded.

Given a CHROM_POS array and subarray invocation like this

the resulting selection would be wildcarded on the varid dimension, and look something like

CHROM_POS selects particular points in the chrom-pos plane, and the varid dimension is a wildcarded term in the Cartesian product, so all variant ids in the input array are selected.

Duplicated Or Out-of-order Picks

During subarray execution, pick arrays are scanned to obtain the pick values prior to any scanning of the input array. The index values (or, for hybrid selection, value tuples) picked for each input array dimension are sorted and duplicate values removed. Unlike array indexing in R or NumPy, subarray cannot be used to introduce duplicate cells, or to rearrange the rows or columns of the input array.

When there are duplicate picks and the join:true option is present, the choice of pick array cell to use to obtain the joined attributes is non-deterministic. For example, suppose a pick dataframe <x:int64, who:string> contains two cells (17, 'Alice) and (17, 'Bob') that both pick the x dimension value 17. There is no way to know which who value, 'Alice' or 'Bob', will be used to generate the result array. If the strict:true option is present, this situation is considered an error and the query is aborted.

Dataframes As Pick Arrays

SciDB dataframes can be used as pick arrays, but not as input arrays. Like ordinary pick arrays, pick indices and tuples taken from dataframe attributes are sorted and de-duplicated prior to scanning the input array. (Since dataframe cells are unordered, this behavior is especially necessary to avoid non-deterministic results.)

Dataframe dimensions are a hidden internal detail and cannot be used for pick indices.

Input Array Overlaps Are Ignored

If the input array contains overlap areas, they are ignored. No overlap areas appear in subarray output, and the overlap parameter for all subarray result schema dimensions is always zero. For example, the three-cell overlap areas from an input array with dimensions [x=0:*:3:1024; y=0:*:3:128] are dropped, and the result schema dimensions will be [x=0:*:0:1024; y=0:*:0:128].

Algorithm Configuration

Subarray chooses its algorithm based on an estimate of whether or not the pick array data will fit in memory. By default, subarray computes the total estimated in-memory size of all pick array data and compares it to the configured value of subarray-arena-limit or, if that value isn’t configured, to merge-sort-buffer. If the total is over the allowed limit, then the largest pick array will be loaded into a “MemArray” data structure that can spill to disk, and a new total is computed for the remaining pick arrays. When all remaining pick arrays are within the limit, they will be loaded into faster in-memory hash tables. This hybrid approach is suitable for most queries.

You can force all pick arrays to load in a particular way by using the algorithm: option. If specified, the value of this option must be one of the following strings:

  • 'hash' – All pick arrays will be loaded into in-memory hash tables regardless of size. If they do not actually fit in memory, an error will occur and the query will abort. You may need to adjust the subarray-arena-limit configuration parameter to use this algorithm.

  • 'memarray' – All pick arrays will be loaded into spill-to-disk MemArray data structures. Used primarily during testing for ensuring correctness of the MemArray representation.

  • 'shapedpickarray' – A legacy algorithm used during early development. Not recommended.