input

The input operator populates an array with data from a file.

In SciDB Enterprise Edition, file I/O restrictions may apply.

Synopsis

input( existing_array|anonymous_schema, input_file
       [,instance_id [, format [, max_errors, [, isStrict ]]]] )

input( existing_array|anonymous_schema, input_file
       [, instance: instance_id ]  [, format: format ]

       [, max_errors: max_errors ]  [, strict: isStrict ]
)
         

Summary

The input operator interprets its arguments similarly to the way the load operator does, but rather than storing its result in the database, it returns a result array.

Use the second form to specify parameters with named keywords.  For example, you can say input(A, '/tmp/mydata', format: 'tsv') to read TSV data without having to remember that -2 is the default value of instance_id

Inputs

The input operator takes the following parameters:

  • existing_array | anonymous_schema: You can specify an existing array or provide an array schema matching the file data to load.
  • input_file: The complete path to the file containing the source data to load.
  • instance_id: Optional.  Specifies the instance or instances for performing the input. The default reads all data from the query coordinator instance, that is, the instance to which the client program is connected.  The value must be one of the following:
    • -2 –  Load all data using the coordinator instance of the query. This is the default.
    • -1 – Load in parallel from all instances. That is, distribute the input operation to all instances, and load data from each instance concurrently. If you use this option, you must prepare individual files for each instance. The input() operator does not automatically split your input file into multiple files for you.
    • 0, 1, ...  –  Load all data using the specified physical instance ID.
    • (x, y) – Load data using the instance specified by the (server_id, server_instance_id) pair (x, y).  For example, if the cluster config.ini file contains the lines server-1=srv42.example.com,2-3 and base-path=/vdisk/scidb you can load data from the data directory /vdisk/scidb/1/3 using input(SCHEMA, 'filename', (1,3)) .
  • format. Optional – the default inputs data from a SciDB-formatted text file. The format string has two parts. The first part indicates the type of file to load:
  • Binary load. When loading binary data, the input() operator uses the format string as a guide for interpreting the contents of the binary file. For a complete description of the binary file format and binary format strings, see Binary Files.
  • CSV load. The string must be csv or CSV, and the supplied file must conform to the comma-separated-value format.
  • OPAQUE load. The string must be opaque or OPAQUE, and you must have previously saved the array data in the OPAQUE format.
  • SciDB-formatted text load. If your text file is in SciDB format, use the string text or TEXT. This is the default.
  • TSV load. The string must be tsv or TSV, and the supplied file must conform to the tab-separated-value format described in DataProtocols article on Linear TSV, extended to accommodate several ways to express nulls: \N, null, and ?0 through ?127.

For CSV and TSV formats, the second part of the format string is a colon followed by one or two characters that control how SciDB interprets the input.

  • p– Interpret the pipe character (ASCII 0x7C) as the field delimiter.
  • c interpret comma (ASCII 0x2C) as the field delimiter.
  • t interpret tab (ASCII 0x09) as the field delimiter.
  • d use double quote (ASCII 0x22) as the CSV quote character.
  • s use single quote (ASCII 0x27) as the CSV quote character.
  • l (lowercase L, ASCII 0x6C)– first line of input is a label line, ignore it.


When loading in CSV format, the input() operator guesses the correct quote character based on the first buffer of input data. If the operator finds no quote character (either single-quote or double-quote) in the first buffer, it assumes single-quoting. To force a particular quote character, use :d or :s.

  • max_errors: Optional. Specifies the limit of errors before the operator fails. The default value is 0, thus if errors occurr, the operation fails.
  • isStrict: Optional. If true, this flag does two things. It restricts the incoming data to contain no out-of-order cell values, where order is row-major as defined by the left-to-right declaration of dimensions in the array schema. It also prevents collisions. If the flag is true and either of these conditions occurs, the input operation fails. By default, this flag is true.

Example

Create a SciDB-formatted file with the following contents:

$ cat /tmp/m4x4_missing.txt 
[
[(0,100),(1,99),(2,98),(3,97)],
[(4),(5,95),(6,94),(7,93)],
[(8,92),(9,91),(),(11,89)],
[(12,88),(13),(14,86),(15,85)]
]

From the "iquery -a" AFL shell, create a two-dimensional array of int32 pairs to hold the data:

AFL% create array m4x4 <val1:int32,val2:int32>[i=0:3; j=0:3];
Query was executed successfully
AFL% input(m4x4,'/tmp/m4x4_missing.txt');
{i,j} val1,val2
{0,0} 0,100
{0,1} 1,99
{0,2} 2,98
{0,3} 3,97
{1,0} 4,null
{1,1} 5,95
{1,2} 6,94
{1,3} 7,93
{2,0} 8,92
{2,1} 9,91
{2,3} 11,89
{3,0} 12,88
{3,1} 13,null
{3,2} 14,86
{3,3} 15,85

Note that the m4x4 array is empty – the data was not stored during the input operation execution. The m4x4 array is used merely as a template to describe the schema returned to the operator calling input().

AFL% scan(m4x4);
{i,j} val1,val2
AFL%

 You can substitute an anonymous schema for the array name in the input operator: 

AFL% input(<val1:int32,val2:int32>[i=0:3; j=0:3],'/tmp/m4x4_missing.txt'); 

The output of this input operation is identical to the output of the input operation that used the m4x4 array name.