Tab Separated Value Files

Tab separated values, or TSV, is a well-defined format for textual data representation and interchange.  The SciDB load, save, and input operators support the LinearTSV dialect of TSV, aimed at serializing data from database systems such as Postgres and MySQL.

LinearTSV was originally promulgated by Open Knowledge International, but has since been folded into their CSV Dialect specification.  SciDB does not support full CSV Dialect (our operators use the GNU libcsv parser).  SciDB TSV users will find the original LinearTSV specification, reproduced below, easier to follow.  The archival link for the specification is https://github.com/frictionlessdata/specs/blob/c8ac05c56cd00695fb215454c0e5a6f3d9c6e899/linear-tsv/index.md .

SciDB Extensions to LinearTSV

SciDB TSV extends LinearTSV by adding support for SciDB missing reason codes.  These are a special case of database null values.  LinearTSV encodes a database null as \N .  This corresponds to a SciDB null value with a missing reason code of zero.  For non-zero missing reason codes, SciDB TSV encodes the null value as ?n for missing reason code n.  You can modify this default behavior using format option specifiers, described in the save documentation.

LinearTSV 1.0-beta

(This section is copied by permission from the archival link above, as a precaution against future OKI github changes.)

title: Linear TSV
subtitle: simple, line-oriented, tabular data
layout: spec
version: 1.0-beta
last_update: 6 May 2014
created: 6 May 2014
well_defined_keywords: true
summary: This document defines a format for tabular data.

Summary


This document defines a line-oriented, tabular data format. The format is intended to be easy to implement, space efficient, streamable, amenable to processing with line-oriented tools and broadly interoperable. The "text" output mode common to many relational databases is line-oriented, tabular, streamable, and easily described; it is here presented as a candidate for standardization.

If you have comments or suggestions please file them in the issue tracker at: https://github.com/dataprotocols/dataprotocols/issues .

Changelog

  • 1.0-beta: initial revision

Table of Contents

  • Will be replaced with the ToC, excluding the "Contents" header {:toc}

Specification

A tabular data file consists of zero or more records consisting of fields.

Records are separated by ASCII newlines  (0x0a). Fields within a record are separated with ASCII tab (0x09). It is permitted but discouraged to separate records with carriage-return-newline (0x0d and 0x0a). (A literal carriage return in any other position is non-conforming.)

Linear TSV provides for escape sequences, instead of quoting, so that implementations can naively split on the byte values of the separators. To include newlines, tabs, carriage returns and backslashes in field data, the following escape sequences must be used:

  • \n for newline,
  • \t for tab,
  • \r for carriage return,
  • \\ for backslash.

Records must contain at least one field. All fields must be present in every record. To indicate missing data for a field, the character sequence \N (bytes 0x5c and 0x4e) is used. Note that the N is capitalized. This character sequence is exactly that used by SQL databases to indicate SQL NULL in their tab-separated output mode.

A zero-length file is a valid linear TSV file, containing no records. Empty lines are ignored in linear TSV files.

If a single backslash is encountered at the end of a field, it is an error. If a backslash precedes another character but does not form one of the escape sequences above, it is a "superfluous backslash" and is removed from the field on read. Such a "superfluous backslash" must never be written by a conforming implementation.

Header Lines

In CSV files, there are commonly header lines, naming the fields; but in this format no provision is made for header lines. One reason is that naive filtering, sorting and concatenation with line-oriented tools could easily mix headers with data.

Another is that the amount of information provided by a header line is rarely sufficient for interoperable parsing. The Data Package spec provides a mechanism for bundling column names, types and format information with data.

Motivation

In advocating a shift to a line-oriented, tab-separated serialization format, we are endorsing an existing format: the default serialization format of both Postgres and MySQL. We propose to standardize a subset of the format common to both database systems.

A truly line-oriented format for tabular data, where newline, carriage return and the separator are always represented by escape sequences, offers many practical advantages, among them:

  • The parsers are simple and fast.
  • First pass filtering and sorting for line-oriented formats is easy to implement in high-level languages, like Python and Java.
  • Analysis and transformation of line-oriented data with command line tools is simple, dependable and often surprisingly efficient.
  • By requiring escape sequences when newlines and tabs are in field text, the format allows parsers to naively and efficiently split data on raw byte values: 0x09 for fields and 0x0a for records.

CSV is almost right and it's worth talking about the disadvantages of CSV that motivate the author to promote another tabular data format:

  • In some locales, , is the decimal separator; whereas the ASCII tab never collides with the decimal separator. More generally, the tab is not a centuries old glyph that one encounters in natural language.
  • CSV is not truly line-oriented – newlines are quoted, not escaped. A single record can span multiple physical lines. In consequence, line-oriented processing almost works until it doesn't, and then simple tricks – sorting on the first column to optimize insertion order or batching records in to groups of a few thousand to get better insert performance – require relatively complicated code to get right.
  • CSV's quoting style requires one to mingle field data parsing and record splitting. Taking every third record still requires one to parse the prior two, since a newline inside quotes is not a record separator.
  • CSV is ambiguous in many small areas – the presence or absence of a header line, the choice of quote character (single or double?) and even the choice of separator character are all axes of variability.