Array Dimensions
SciDB arrays require at least one dimension. Dimensions form the coordinate system for a SciDB array.
Data Frames are SciDB arrays whose dimensions and distribution are managed implicitly by the database. You do not need to specify any dimensions when working with data frames.
Each dimension consists of:
- A name.
- If you specify only the name, SciDB leaves the chunk length unspecified, uses the default overlap of zero, and makes the dimension unbounded. Otherwise you must specify at least the low value and high value.
- Just like attributes, you must name each dimension without repeating dimension names in the same array. The maximum length of a dimension name is 1024 bytes. A dimension name may include alphanumeric characters and the underscore (_) character, but must begin with a non-numeric character.
- A low value. An expression for the dimension start value.
- A high value. Supply either an expression or an asterisk (*) as the dimension end value. Asterisks indicate the dimension has no limit (referred to as an unbounded dimension). Together, the starting and ending values define the range of possible values that the dimension coordinate can take. This range includes both the starting and ending values themselves. For example,
[i=1:1000]
defines a dimension size of 1000. - A chunk overlap. The optional number of overlapping dimension values for adjacent chunks.
- A chunk length. The optional number of dimension values between consecutive chunk boundaries. For a one-dimensional array, the chunk length determines the maximum number of cells in each chunk. For an n-dimensional array, the maximum number of cells in each chunk is the product of the chunk length parameters of each dimension. If you do not specify a chunk length, SciDB chooses a chunk length based on the schema's context.
Chunk size is the product of the chunk lengths of all the dimensions. Chunk size choice has a very significant effect on SciDB performance. In most cases, any given chunk of an array should contain between 100,000 and 10,000,000 non-empty cells.
Additionally:
- Dimension size is determined by the range from the dimension start to end, so, for example ranges of 0:99 and 1:100 would create the same dimension size.
- You can use expressions to define dimension size, or let SciDB assign default values. Expressions must evaluate to a scalar value. For example, an expression might be 100, or 500 * 4, and so on.
- You can supply a ? for any dimension parameter, and SciDB will assign a default value.
For example, this statement:
AFL% CREATE ARRAY A <x: double, err: double> [i=0:99:0:10; j=-9:10];
creates an array named A with two attributes x and err, both of type double, and two dimensions, i and j.
- For dimension i the starting and ending coordinates are 0 and 99, the chunk overlap is 0, and the chunk length is 10.
- For dimension j the starting and ending coordinates are -9 and 10, the chunk overlap is unspecified but defaults to 0, and the chunk length is also unspecified. The chunk length of a newly created array is set when data is first stored into the array.
Specifying Dimensions
The CREATE ARRAY statement includes a list of dimensions. The syntax is as follows:
dim_separator ::= , | ; dimensions ::= dimension [ dim_separator dimensions ] dimension ::= dim_name | # Lone identifier, same as dim_name=0:*:0:* dim_name = dim_lo : dim_hi [ : overlap [ : chunk_length ] ] | # Preferred syntax dim_name = dim_lo : dim_hi , chunk_length , overlap # Pre-16.9 compatibility syntax dim_lo ::= expression | ? dim_hi ::= expression | * | ? chunk_length ::= expression | ? overlap ::= expression | ?
In SciDB 16.9 and later releases, providing overlap and chunk_length values in the dimension syntax is now optional. The comma-separated dimension syntax of previous releases is still supported, and is backward compatible.
In the new dimension syntax:
- the order of overlap and chunk_length is reversed
- use a semi-colon
;
to separate individual dimension descriptions - use a colon
:
to separate parameters within a dimension description - overlap is optional unless a chunk_length is given
- chunk_length is optional, and if not specified, SciDB chooses one based on the context in which the schema is used
Dimension Defaults
If you specify only the dimension name, SciDB creates defaults for the starting index and chunk overlap. SciDB creates the dimension as an unbounded dimension (a dimension without a specified ending index). The chunk length is left unspecified. If the schema is used for a create array statement, the chunk length will become fixed when data is first stored into the array. If the schema is used in a reshaping operator like redimension, the chunk length will be calculated dynamically.
This example uses default values for the dimensions (default starting index, chunk length, and overlap):
$ iquery -a AFL% create array default_1 <val:double>[i]; AFL% show(default_1); {i} schema {0} 'default_1<val:double> [i=0:*:0:*]' AFL%
Array default_1 has one dimension named i, and was created with the defaults for all of the dimension values.
This example creates three dimensions, where two of them use the defaults:
AFL% create array default_2 <val:double>[i=0:999:0:200; j; k]; AFL% show(default_2); {i} schema {0} 'default_2<val:double> [i=0:999:0:200; j=0:*:0:*; k=0:*:0:*]'
Array default_2 dimensions j and k are left with unspecified chunk lengths, denoted by asterisks, *
. When data is stored into the array, these chunk lengths will be set according to the chunk lengths of the incoming data.
Use a question mark ?
to have SciDB statically compute chunk lengths so that the overall chunk size is close to the target-cells-per-chunk configuration option (default: 1,000,000). In the next example, SciDB computes the chunk lengths so that their product is 980,000: 200 x 70 x 70.
AFL% create array default_3 <val:double>[i=0:999:0:200; j=0:*:0:?; k=0:*:0:?]; Query was executed successfully AFL% show (default_3); {i} schema {0} 'default_3<val:double> [i=0:999:0:200; j=0:*:0:70; k=0:*:0:70]'
Chunk Overlap
You can specify the chunk overlap for each dimension of an array. For example, consider an array A with the following schema:
A <a: int32>[i=1:10:1:5; j=1:30:5:10]
Array A has has two dimensions, i and j. Dimension i has size 10, chunk overlap 1, and chunk size 5. Dimension j has size 30, chunk overlap 5, and chunk size 10. Specifying a non-zero overlap causes SciDB to store adjoining cells in each dimension from the overlap area in neighboring chunks. Storing some adjoining cells in this way can reduce communications overhead for some queries. In particular, overlap regions can:
- Speed up nearest-neighbor queries, where each chunk may need access to a few elements from its neighboring chunks.
- Detect data clusters or data features that straddle more than one chunk.
Unbounded Dimensions
Unbounded dimensions are dimensions without a specified ending index. You create unbounded dimensions by declaring the high boundary as '*
'. When the high boundary is set as *
the array boundaries dynamically update as new data enters the array. This is useful when the dimension size is not known at CREATE ARRAY time. For example, this statement creates an array named open with two dimensions:
- Bounded dimension I of size 10, chunk overlap 0, and chunk size 10.
- Unbounded dimension J of size *, chunk overlap 0, and chunk size 10.
AFL% CREATE ARRAY open <val:double>[I=0:9:0:10; J=0:*:0:10];