SciDB Database Arrays

SciDB databases are organized into arrays containing:

  • A name. Each array in a SciDB database has an identifier that distinguishes it from all other arrays in the same database.  Names can only include alphanumeric characters and the underscore (_), and they must begin with a non-numeric character.
  • A schema, which is the array structure. The schema contains array attributes and dimensions.
    • Each attribute contains data being stored in the cells of the array. A cell can contain multiple attributes.
    • Each dimension consists of a list of index values. At the most basic level, the dimension of an array is represented using 64-bit signed integers. The number of index values in a dimension is referred to as the size of the dimension.
  • A distribution. The array distribution determines where in the SciDB cluster the chunks of data are located.  For details see Array Distribution.
  • compression algorithm for the empty tag attribute.  Valid values are 'zlib' and 'bzlib'.  The default is no compression.  Most users should stay with the default value.

Creating Arrays 

You create SciDB arrays and data frames with the CREATE ARRAY statement (in either AFL or AQL) or the create_array operator (AFL only). The CREATE ARRAY statement syntax is as follows:

create_array_statement ::= CREATE [ TEMP ] ARRAY
                                            new_array_name
                                            schema
                                            [ DISTRIBUTION distribution ]
 
create_array_statement ::= CREATE [ TEMP ] ARRAY
                                            new_array_name
                                            schema
                                            DISTRIBUTION distribution
                                            [ EMPTYTAG COMPRESSION 'zlib' ]
 
schema                 ::= < attributes >  [ \[ dimensions \] ]

The keywords CREATE, ARRAY, TEMP, DISTRIBUTION, and EMPTYTAG COMPRESSION are allowed in both AFL and AQL.  They need not be in all caps.

Square brackets [ ] surround optional elements.  Backquoted square brackets \[ \] are literal square brackets.

When creating arrays inside transactions, you must use the create_array AFL operator.  Its syntax is

create_array ( new_array_name ,  schema , isTempArray
               [ , distribution [ , emptyTagCompression ] ] )

where new_array_name and schema are as before, isTempArray is either true or false, distribution is the unquoted distribution name, and emptyTagCompression is the quoted name of the compression method (for example, 'zlib').

Temporary Arrays

Temporary arrays can improve performance but they do not offer the transactional guarantees of persistent arrays. Temporary arrays are like other arrays in that they are only visible to SciDB users who have permission to see them and arrays remain available until they are deleted. Temporary arrays are not persistent, that is, they do not save to disk. As such, temporary arrays become corrupted if a SciDB instance fails. When a SciDB cluster restarts, all temporary arrays are marked as unavailable (but not deleted; you must delete them explicitly). In addition, temporary arrays do not have versions. Any update to a temporary array overwrites existing attribute values.

Use a temporary array when you are willing to sacrifice "Atomicity, Consistency, Isolation, and Durability" (ACID) guarantees for speed. For example, let's say you are using SciDB to multiply two matrices from within a Python program, and you are sending the operand matrices to SciDB through SciDB-Py. You want the resulting matrix product sent back to the Python program, but you don't need the matrix product persisted in the SciDB database. In this case, a temporary array is appropriate. Similarly, use temporary arrays when performing iterative algorithms whose intermediate results are arrays.

Creating Temporary Arrays 

To create a temporary array, use the optional TEMP keyword with the CREATE ARRAY statement syntax shown above, or set the isTempArray argument of the create_array operator to true.

Example:

 $ iquery -aq "create temp array tempArray <a:int32>[i=1:10]"

Data Frames

Data frames are SciDB arrays whose dimensions do not have to be specified.  They are similar to relational tables.  Data frames are a kind of SciDB array, but SciDB manages their dimensions implicitly and does not  display the internal dimension coordinates.  Think of a data frame as an unordered collection of SciDB cells.

Since data frames are a kind of SciDB array, most remarks in this documentation that refer to arrays also apply to data frames.

SciDB data frames can be used nearly anywhere a SciDB array can be used.  They are primarily a notational convenience, simplifying import of linear data (such as from a CSV file) that will later be reshaped into an array.

Since data frame cells are unordered, operators that depend on cell position, such as between or slice, won't work with data frames.

Creating Data Frames

To create a data frame, use the same syntax for creating an array or temporary array, but omit the dimension portion of the schema.

Example:

$ iquery -a
AFL% create array dataFrame <a:int32>;
Query was executed successfully
AFL% create temp array tempDataFrame <value:double>;
Query was executed successfully
AFL% 

You can create a data frame from an array using the flatten operator, and add cells to it using the append operator.