redimension

The redimension operator produces a result array using some or all variables of a source array, potentially changing some or all of those variables from dimensions to attributes or vice versa, and optionally calculating aggregates to include in the result array.

Synopsis

redimension(source_array,
            template_array|schema_definition
            [, isStrict]
            [, aggregate_call_n (source_attribute)
[as result_attribute]] ...
            [, cells_per_chunk: N] [, phys_chunk_size: N] )

Summary

The redimension operator reshapes its source array according to a supplied schema definition.  If a template array name is given rather than a schema definition, SciDB reshapes the source array according to the schema of the template array, without altering the template array itself.

SciDB maps the attributes and dimensions of the source array – its variables – to the like-named variables of the result array. 

SciDB may: 

  • preserve source array attributes as an attribute of the result array with the same data type
  • convert source array attributes into a dimension of the result array if its integer data type is convertible to int64
  • omit the source array attributes altogether

Similarly, SciDB may:

  • preserve source array dimensions as a dimension of the result array
  • convert source array dimensions into int64 attributes of the result array
  • omit the source array dimension altogether.

SciDB converts source array variables to result array variables according to these rules, based solely on variable name.  Source array variables can appear in the result schema in any order.  Even if SciDB omits a source array variable from the result array schema, it may still use it as input to an aggregate call.

A single use of the redimension operator can make all of the above modifications to the variables in the source array.

Cell Collisions

Depending on how the source array's variables appear in the result array, the redimension operator might encounter cell collisions. Cell collisions occur when the redimension operator generates multiple candidate cell values for a single cell location of the result array.  If the redimension operator produces a collision at a cell location, there are four possible cell collision actions:

  • Aggregate: The redimension operator performs an aggregation over all the candidate cells using the aggregation function you supply. That is, if you named a result array attribute as the value of an aggregate function, the attribute becomes the value of that function calculated over the set of candidate cells for that cell location. Each aggregate you calculate requires a nullable attribute to accommodate it in the result array's schema. The attribute requires the appropriate data type for that aggregator function.
  • Expand along the synthetic dimension: If the result schema includes a synthetic dimension (a dimension whose name is not present in the source array, either as an attribute or a dimension), the redimension operator uses that dimension to accommodate all of the cells that collide on the other result dimension(s).
  • Fail: If the isStrict flag is true (the default), the redimension operation fails unless (a) the colliding cell is being aggregated or (b) there is a synthetic dimension.
  • Arbitrarily choose: If isStrict is false, the result schema does not include a synthetic dimension, and you did not specify any aggregations, the result array cell is an arbitrarily-chosen candidate cell for that location.

The result array schema may not include any variables other than:

  • variables that appear in the source array schema
  • attributes named as the result of an aggregation
  • at most one synthetic dimension.

The dimensions of the result array may receive new chunk length and chunk overlap values (see Array Dimensions).  Thus redimension includes all of the functionality of the repart operator.

Inputs

The redimension operator takes the following arguments:

  • source_array - A source array with one or more attributes and one or more dimensions.
  • template_array | schema_definition - An array or schema from which the output attributes and dimensions are acquired. All the dimensions in the target must exist either in the source array attributes or dimensions, with one exception. One new, synthetic dimension is allowed. All the target attributes (other than those that are the results of aggregate calls) must exist either in the source array attributes or dimensions.
  • isStrict - Optional.  A Boolean, true by default. If false, the redimension operation allows collisions in the result array.  See Cell Collisions above.
  • aggregate_call_n - Optional.  One or more aggregate calls.
  • cells_per_chunk: N - Optional.  Use N as the goal when estimating chunk lengths by cell count.
  • phys_chunk_size: N - Optional.  Use N mebibytes as the goal when estimating chunk lengths by physical chunk size.

Limitations

  • You can only change integer typed attributes to dimensions, and only if the attribute type is convertible to int64 without loss. Uint64 source attributes cannot become result dimensions.
  • Except for newly-added aggregate values and the synthetic dimension, result array variables must be a subset of the variables in the source array.
  • If a dimension of the new array corresponds to an attribute of the source array:
    • The dimension must be large enough to accommodate all distinct values of that attribute present in the source array. 
    • The attribute in the source array can allow null values, but cells for which the attribute is null are not included in the output.
  • When using aggregates as part of the redimension operator, the destination attributes for containing the aggregate values must allow null values.

Automatic Chunk Size Calculation

If you provide a schema_definition argument, the redimension operator automatically estimates chunk lengths for dimensions whose chunk lengths are either left unspecified or given as asterisk, '*'.  Choosing the best performing chunk lengths depends on the query mix and the nature of the data, so consider the values redimension chooses as reasonable starting points for possible further optimization.  The redimension operator estimates chunk lengths either by desired cell count (the default) or by desired physical chunk size.  Both estimation algorithms assume dense data.

Estimation by cell count

In estimation by cell count, redimension chooses chunk lengths to achieve N cells per chunk, with N = 1,000,000 by default.  You can change the value of N system-wide by setting the target-cells-per-chunk parameter in the config.ini file.  To change the value of N for one redimension operation only, use the cells_per_chunk: N keyword argument.

Estimation by physical chunk size

In estimation by physical chunk size, redimension chooses chunk lengths to achieve a physical chunk size of N mebibytes per chunk, based on the average size of the largest attribute in the output schema.  You can change the value of N system-wide by setting the target-mb-per-chunk parameter in the config.ini file.  To change the value of N for one redimension operation only, use the phys_chunk_size: N keyword argument.

If both target-mb-per-chunk and target-cells-per-chunk appear in the same config.ini file, target-mb-per-chunk takes precedence.


Examples

Load a 2-dimensional array from a flat (1-D) file

  1. Create an array to hold the one-dimensional data:

    AFL% CREATE ARRAY raw_data <device:int32,pos:int32,val:float>[row=0:*];
    Query was executed successfully
  2. Create a small data file in comma-separated value format and load it into the 1-D array:

    $ cat /tmp/data.csv
    1, 1, 1.334
    1, 2, 1.334
    1, 3, 1.334
    1, 4, 1.334
    2, 1, 2.445
    2, 3, 2.445
    2, 4, 2.445
    3, 1, 0.998
    3, 2, 1.998
    3, 3, 1.667
    3, 4, 2.335
    4, 1, 2.004
    4, 2, 2.006
    4, 3, 2.889
    $ iquery -naq "load(raw_data, '/tmp/data.csv', -2, 'csv')"
    Query was executed successfully
  3. Use redimension to transform the "device" and "pos" attributes of the raw_data array into dimensions of the DP result array.  These attributes have type int32, which is convertible without loss to the int64 dimension type.  The "row" dimension of the raw_data array is omitted from the result array.

    $ iquery -a
    AFL% store(redimension(raw_data, <val:float>[device=1:*; pos=1:*]), DP);
    Query was executed successfully
    AFL% scan(DP);
    {device,pos} val
    {1,1} 1.334
    {1,2} 1.334
    {1,3} 1.334
    {1,4} 1.334
    {2,1} 2.445
    {2,3} 2.445
    {2,4} 2.445
    {3,1} 0.998
    {3,2} 1.998
    {3,3} 1.667
    {3,4} 2.335
    {4,1} 2.004
    {4,2} 2.006
    {4,3} 2.889
    AFL% 
  4. Use the input operator to perform all these steps in a single query, without having to create the intermediate raw_data array.  This query creates an array DP2 with schema and contents identical to DP above:

    AFL% store(
           redimension(
             input(<device:int32,pos:int32,val:float>[row=0:*], '/tmp/data.csv', -2, 'csv'),
             <val:float>[device=1:*; pos=1:*]),
           DP2);

Using redimension with specific chunk sizes

In the example above, chunk sizes and chunk overlaps were omitted from the array schemas for simplicity.  The redimension operator calculates unspecified chunk sizes based on the input array's data values.  If particular chunk sizes are desired, they can be explicitly specified and redimension will obey them.

For the DP array above, the show operator output includes the calculated chunk sizes (they are 4 in this toy example):


AFL% show(DP);
{i} schema,distribution,etcomp
{0} 'DP<val:float> [device=1:*:0:4; pos=1:*:0:4]','hashed','none'



Suppose that storing 10,000 pos values per physical chunk would improve the performance of later queries on the DP array.  We can specify this chunk size when redimensioning the raw_data array:

AFL% store(redimension(raw_data, <val:float>[device=1:*; pos=1:*:0:10000]), DP3);
...same output as above...
AFL% show(DP3);
{i} schema,distribution,etcomp
{0} 'DP3<val:float> [device=1:*:0:1; pos=1:*:0:10000]','hashed','none'

The explicit chunk size of 10,000 is preserved and redimension calculates any remaining unspecified chunk sizes based on the input data.

Using redimension with aggregates

Using the DP "device/position" array created above, you can compute some aggregates along the device axis to see the minimum, average, and maximum values reported by the devices at particular positions.  Because device is omitted from the result schema, all of the source cells at a particular pos value collide, and the colliding cells feed the aggregate calls.

AFL% redimension(DP, <minVal:float,avgVal:double,maxVal:float>[pos=1:*],
                 min(val) as minVal, avg(val) as avgVal, max(val) as maxVal);
{pos} minVal,avgVal,maxVal
{1} 0.998,1.69525,2.445
{2} 1.334,1.77933,2.006
{3} 1.334,2.08375,2.889
{4} 1.334,2.038,2.445


Using redimension with a synthetic dimension

Suppose you have a file of image data containing latitude, longitude, and pixel color values (red/green/blue or "RGB" values).  You know that the file actually contains data from several images of the same area, so that a particular pair of (latitude, longitude) values might appear several times in the file, with different RGB color values. That is, there might be cell collisions.  This example shows how redimension's synthetic dimension feature helps you load the data despite the cell collisions.

  1. Load this small example CSV file, which contains the kind of data described and includes a collision at (-42332,70944):

    $ cat /tmp/images.csv
    -42332,70944,255
    -42333,70944,16777210
    -42332,70944,254
    -42320,70943,253
    -42325,70941,252
    -42321,70942,240
    $ iquery -naq "store(input(<lat:int64,lon:int64,rgb:uint32>[row=0:*], '/tmp/images.csv', -2, 'csv'), raw_image)"
    Query was executed successfully
    $
  2. Try to create a two-dimensional array using redimension, this time without specifying a synthetic dimension:

    $ iquery -a
    AFL% store(redimension(raw_image, <rgb:uint32>[lat=-90000:90000; lon=-180000:180000]), IMAGE1);
    UserException in file: src/query/ops/redimension/RedimensionCommon.cpp function: redimensionArray line: 1073
    Error id: scidb::SCIDB_SE_OPERATOR::SCIDB_LE_DATA_COLLISION
    Error description: Operator error. Duplicate values at '{-42332, 70944}'.
    AFL% 
  3. Try again, still without the synthetic dimension, but this time relax the isStrict constraint.  The result is that one of the colliding cells is chosen, however, there is no way to tell which cell is picked:

    AFL% store(redimension(raw_image,
                           <rgb:uint32>[lat=-90000:90000; lon=-180000:180000],
                           false),  -- isStrict constraint is relaxed
               IMAGE1);
    Query was executed successfully
    AFL% scan(IMAGE1);
    {lat,lon} rgb
    {-42333,70944} 16777210
    {-42332,70944} 255
    {-42325,70941} 252
    {-42321,70942} 240
    {-42320,70943} 253
    AFL%
  4. By adding a synthetic dimension – a dimension whose name is different from any source array attribute or dimension – you instruct redimension to resolve cell collisions by storing all colliding cells along the synthetic dimension.  Here the synthetic dimension is named "synth", but you can use any name not found in the source array.

    AFL% store(redimension(raw_image, <rgb:uint32>[lat=-90000:90000; lon=-180000:180000; synth=0:3]), IMAGE2);
    Query was executed successfully
    AFL% scan(IMAGE2);
    {lat,lon,synth} rgb
    {-42333,70944,0} 16777210
    {-42332,70944,0} 255
    {-42332,70944,1} 254
    {-42325,70941,0} 252
    {-42321,70942,0} 240
    {-42320,70943,0} 253
    AFL%
  5. Remove the example arrays:

    AFL% remove(raw_data); remove(DP); remove(DP2); remove(DP3); remove(raw_image); remove(IMAGE1); remove(IMAGE2);