Binary Format Strings
The load() macro, input() operator, and save() operator all manipulate data in files. These files can take many forms, one of which is the SciDB binary file format. When you invoke load(), input(), or save() to operate on a SciDB binary file, you must include a binary format string as the value of the format parameter.
For example, the following AFL statements create an array and load data from a binary file into that array.
AFL% create array A <a1:int64 NOT NULL,a2:int16,a3:string,a4:string NOT NULL>[i=0:*]; AFL% load ( A, 'DIRECTORY_NAME/file.bin', -2, '(int64, int16 null, string null, string)' );
Â
In the example, the binary format string appears as the fourth parameter of the load statement. In this case, each entry of the binary format string matches the corresponding attribute of the array's schema.
Types in a schema's attribute list are nullable by default, but types in a binary format string disallow nulls by default. Thus attribute a1 is "int64 not null" in the schema but "int64" in the format string. Similarly, attribute a3 is "string" in the schema but "string null" in the format string.
Â
Types in the binary format string can diverge from the corresponding types in the array schema in the following ways.
- The default nullability of types in the two contexts is reversed (as noted above).
- The base type of an entry in the binary format string need not match the corresponding base type in the schema exactly. The format string entry can specify any base type that is convertible to the base type of the corresponding attribute.
- The binary format string can contain SKIP directives, which indicate where to ignore bytes (during load or input), or where to insert padding (during save).Â
Type Conversion During Binary Load
A binary format string typically consists of n individual entries where n is the number of attributes in the corresponding array. Each entry refers to a data type.
A data type in a binary format string corresponds to an attribute in the array, but it need not match the attribute's data type exactly. The entry in the format string can specify any type that is convertible to the corresponding attribute's type. For example:
If the attribute base type is | and the corresponding format string base type is | then |
---|---|---|
int64 | int64 | All three operators – load(), input(), and save() – work because the attribute and format string base types match exactly. |
int64 | int32 | The operation works for load() and input(). For save(), the operation produces unpredictable results because the format string base type cannot represent all values of the attribute base type. |
If the array includes an attribute with a user-defined data type, the corresponding entry in the format string can be string or binary, provided you implemented the appropriate data type converters: string-to-UDT, binary-to-UDT, UDT-to-string, and UDT-to-binary.
The SKIP Directive in Binary Format Strings
When you load or input a binary file, you can instruct SciDB to skip some data in the file. Analogously, when you save a one-dimensional array into a binary file, you can instruct SciDB to insert padded bytes into the file. The following two sections elaborate.
SKIP Directive When Loading Or Inputting Files
When you load or input a binary file, you can instruct SciDB to skip some data in the file. This is useful for excluding entire fields from the load operation, or for skipping over padding bytes added by an external program to a field when it produced the binary file. (Some programming languages always align field values with 32-bit word boundaries.)
Skipping entire fields
From a binary file with n attributes, you can load a one-dimensional SciDB array that has m attributes, where m < n. You do this with the SKIP keyword. Compare the following three pairs of AFL statements which create and populate arrays excluding zero, one, and two fields of the same load file.
The first pair of statements loads all fields. (Schema attributes are nullable by default, but format string types disallow null by default.)
AFL% create array intensityFlat < exposure : string NOT NULL, elapsedTime : int64 NOT NULL, measuredIntensity : int64 > [i=0:*]; AFL% load(intensityFlat, 'DIRECTORY_NAME/intensity_data.bin', -2, '(string, int64, int64 null)');
The second pair of statements excludes a string field:
AFL% create array intensityFlat_NoExposure <elapsedTime:int64 NOT NULL,measuredIntensity:int64>[i=0:*]; AFL% load(intensityFlat_NoExposure, 'DIRECTORY_NAME/intensity_data.bin', -2, '(skip, int64, int64 null)');
The third pair of statements excludes two int64 fields, one of which allows null values:
AFL% create array intensityFlat_NoTime_NoMeasurement <exposure:string NOT NULL>[i=0:*]; AFL% load(intensityFlat_NoTime_NoMeasurement, 'DIRECTORY_NAME/intensity_data.bin', -2, '(string, skip(8), skip(8) null)');
Â
The preceding pairs of AFL statements illustrate the following characteristics of the SKIP keyword:
- For variable-length fields, you can use the SKIP keyword without a byte count.
- For fixed-length fields, you can use the SKIP keyword with a byte count in parentheses.
- To skip a field that can contain null values, use the NULL keyword after the SKIP keyword.
SKIP Directive When Saving Files
When you save a one-dimensional array into a binary file, you can instruct SciDB to insert padding bytes into the file. This is useful if the file is intended for an application that expects data values to be aligned in a particular way – for example, on 32-bit word boundaries.
When saving an array into a binary file, you can use the SKIP directive in several ways:
- skip. Inserts four NUL (ASCII 0x00) bytes. (This corresponds to a variable-length field of length zero.)
- skip(n). Inserts n NUL bytes. (This corresponds to a fixed-length n-byte field.
- skip null. Inserts five NUL bytes. (This corresponds to a SciDB missing reason code of zero, followed by four NUL bytes for the length.)
- skip(n) null. Inserts n+1 NUL bytes. (This corresponds to a SciDB missing reason code of zero followed by a fixed-length n-byte field.
Using skip during save() is symmetric with its use during load() or input(). That is, if you save a binary file with a particular format string, you can load it later using the same format string. Likewise if an external application generates a binary file and you load it using a particular format string, you can save it later save with the same format string and the result is compatible with the format used by the external application.
For example, suppose an external application inserts padding in its output so that it writes integers only aligned with word boundaries. To read the output file:
AFL% store(input(<code:char NOT NULL, value:int32 NOT NULL>[i=0:*], '/home/bob/appout.bin', -2, '(char, skip(3), int32)'), A);Â
Â
Use the same binary fomat string to write an output file with compatible padding:
AFL% save(A, '/home/alice/result.bin', -2, '(char, skip(3), int32)');Â