load

The load operator is a macro that loads data from a file into a named array. It combines the functionality of the input operator and the store operator.

In Enterprise Edition, file I/O restrictions may apply.

Synopsis

load( output_array, input_file
      [,instance_id [,format [,max_errors [,isStrict ]]]] )

load( output_array, input_file
      [, instance: instance_id ]  [, format: format ]
      [, max_errors: max_errors ]  [, strict: isStrict ]
)

Summary

The AFL load macro loads data from input_file into the cells of a SciDB array, output_array. The output_array can be either a temporary array or a persistent array.

Use the second form to specify parameters with named keywords.  For example, you can say load(A, '/tmp/mydata', format: 'tsv') to load TSV data without having to remember that -2 is the default value of instance_id. 

load() is not technically an operator, but rather a macro that expands to store(input()). Because macro expansion precedes other query processing, you can use load() as you would any operator. The output of list('macros') includes load and the output of list('operators') does not.

Inputs

The load operator takes the following arguments: 

  • output_array: the name of a SciDB array to hold the data.
  • input_file: the complete path to the file containing the source data for the array.
  • instance_id: Optional. Specifies the instance or instances for performing the load. The default is to load all data from the coordinator instance, that is, the instance to which the client program is connected. The value must be one of the following:
    • -2 – Load all data using the coordinator instance of the query. This is the default.
    • -1 – Initiate the load from all instances. That is, distribute the load to all instances, and load the data concurrently. If you use this option, you must prepare individual load files for each instance. That is, the load() macro does not split your loaded file into multiple files for you.
    • 0, 1, ... – Load all data using the specified physical instance ID.
    • (x, y) – Load data using the instance specified by the (server_id, server_instance_id) pair (x, y).  For example, if the cluster config.ini file contains the lines server-1=srv42.example.com,2-3 and base-path=/vdisk/scidb you can load data from the data directory /vdisk/scidb/1/3 using load(ARRAY, 'filename', (1,3)) .
       
  • format: Optional. The default is to load data from a SciDB-formatted text file. The format string has two parts. The first part indicates the type of file that you are loading.
    • Binary load: When loading binary data, the load operator uses the format string as a guide for interpreting the contents of the binary file. For a complete description of the binary file format and binary format strings, see Binary Files.
    • CSV load:The string must be csv or CSV and the supplied file must conform to the comma-separated-value format.
    • OPAQUE load: The string must be opaque or OPAQUE and you must have previously saved the array data in the OPAQUE format.
    • SciDB-formatted text load: If your text file is in SciDB format, use the string text or TEXT. This is the default.
    • TSV load: The string must be tsv or TSV and the supplied file must conform to the tab-separated-value format Linear TSV, extended to accommodate several ways to express nulls: \N, null, and ?0 through ?127.

For CSV and TSV formats, the second part of the format string consists of a colon followed by one or two characters that controls how the input file is interpreted.

    • p– Interpret the pipe character (ASCII 0x7C) as the field delimiter. 
    • c– interpret comma (ASCII 0x2C) as the field delimiter. 
    • t– interpret tab (ASCII 0x09) as the field delimiter. 
    • d– use double quote (ASCII 0x22) as the CSV quote character.
    • s– use single quote (ASCII 0x27) as the CSV quote character.

    • l (lowercase L, ASCII 0x6C)– first line of input is a label line, ignore it.


When loading in CSV format, the load() macro guesses the correct quote character based on the first buffer of input data. If the macro finds no quote character (either single-quote or double-quote) in the first buffer, it assumes single-quoting. To force a particular quote character, use :d or :s.


  • max_errors: Optional. Specifies the maximum number of errors before the operator fails. The default value is 0, meaning that if any errors are encountered, the operation fails.
  • isStrict: Optional. If true, this flag does two things. It restricts the incoming data to contain no out-of-order cell values, where order is row-major as defined by the left-to-right declaration of dimensions in the array schema. It also prevents collisions. If the flag is true and either of these conditions occurs, the load operation fails. By default, this flag is true.

Example

  1. Create a text file /tmp/intensity_data.csv with the following content:

    'High',0,100
    'High',2,97
    'Medium',2,null
    'Medium',3,67
    'Low',3,null
    'Low',4,55 
  2. Create an array to hold the array data:

    AFL% CREATE ARRAY intensityFlat <exposure:string,elapsedTime:int64,measuredIntensity:int64>[i=0:5];
  3. Load the data:

    AFL% load(intensityFlat ,'/tmp/intensity_data.csv', -2, 'CSV'); 


  4. Scan the loaded array:

    AFL% scan(intensityFlat);
    {i} exposure,elapsedTime,measuredIntensity
    {0} 'High',0,100
    {1} 'High',2,97
    {2} 'Medium',2,null
    {3} 'Medium',3,67
    {4} 'Low',3,null
    {5} 'Low',4,55 

     

  5. Suppose your schema has a measuredIntensity attribute that does not allow nulls.  In that case, you would expect errors when loading /tmp/intensity_data.csv.  You can add a value for the max_errors parameter.

    AFL% CREATE ARRAY intensityFlatNotNull<exposure:string,elapsedTime:int64,measuredIntensity:int64 NOT NULL>[i=0:5];
    AFL% load( intensityFlatNotNull,
               '/tmp/intensity_data.csv',
               -2,
               'csv',
               99);
  6. Scan the loaded array:

    AFL% scan(intensityFlatNotNull);
    {i} exposure,elapsedTime,measuredIntensity
    {0} 'High',0,100
    {1} 'High',2,97
    {2} 'Medium',2,0
    {3} 'Medium',3,67
    {4} 'Low',3,0
    {5} 'Low',4,55 

     

  7. Remove the arrays:

    AFL% remove(intensityFlat); remove(intensityFlatNotNull);