Array Versions
SciDB arrays and dataframes are versioned. Each time data is stored, inserted, or appended to an array, a new array version is created. Successive versions of an array can share data chunks to reduce storage overhead. Older versions can be accessed by version number or by timestamp. They remain accessible until they are manually removed.
In the following example we will create several versions of an array, list the versions and inspect their contents, and remove older versions to reclaim storage space.
Creating several array versions
Here we create a one-dimensional array of strings and populate it from a tab-separated-values file.
$ iquery -a
AFL% create array WORDS <word:string>[i=0:*];
Query was executed successfully
AFL% load(WORDS, '/tmp/words.tsv', format:'tsv');
Query was executed successfully
AFL% scan(WORDS);
{i} word
{0} 'May'
{1} 'I'
{2} 'have'
{3} 'a'
{4} 'large'
{5} 'container'
{6} 'of'
{7} 'coffee'
{8} 'cream'
{9} 'and'
{10} 'sugar'
AFL%
Listing array versions
We can examine the versions of this array using the versions operator:
AFL% versions(WORDS);
{No} version_id,timestamp
{1} 1,'2020-06-17 18:07:12'
AFL%
So far there is only one version of stored data. The associated timestamp is in ISO-8601 notation and is displayed in Universal Coordinated Time (UTC).
(You can get more detailed information about an array’s versions, including the initial empty version that versions does not show, using a variant of the list operator: list('arrays', true)
.)
Now we create a few more versions of WORDS by adding and populating a new attribute:
AFL% add_attributes(WORDS, <len:int64>);
Query was executed successfully
AFL% versions(WORDS);
{No} version_id,timestamp
{1} 1,'2020-06-17 18:07:12'
{2} 2,'2020-06-17 19:11:36'
AFL%
AFL% store(project(apply(WORDS, tmp, int64(strlen(word))), word, tmp), WORDS);
Query was executed successfully
AFL% scan(WORDS);
{i} word,len
{0} 'May',3
{1} 'I',1
{2} 'have',4
{3} 'a',1
{4} 'large',5
{5} 'container',9
{6} 'of',2
{7} 'coffee',6
{8} 'cream',5
{9} 'and',3
{10} 'sugar',5
AFL% versions(WORDS);
{No} version_id,timestamp
{1} 1,'2020-06-17 18:07:12'
{2} 2,'2020-06-17 19:11:36'
{3} 3,'2020-06-17 19:16:54'
AFL%
Acessing array versions
You can retrieve a particular past version of an array using an at-sign and the version id. Here’s the example array just after the new len
attribute was added:
The same version can be accessed using a timestamp:
This query uses the built-in data type datetime
to cast a string into an internal representation of the time, and uses that to locate the desired array version. The timestamp need not exactly match any of the timestamps in the versions output. SciDB returns the latest version of the array that is equal to or precedes the given timestamp.
Remember, the timestamps are in Universal Coordinated Time (UTC). You must specify the complete time down to the second: ‘YYYY-MM-DD hh:mm:ss’.
Removing array versions
Array versions can accumulate over time, resulting in increased disk usage. You can use the remove_versions operator to periodically remove unneeded array versions and recover storage space. The easiest form of the operator specifies the number of recent versions to keep:
The query remove_versions(A)
is the same as remove_versions(A, keep:1)
.