/
scidb_backup.py

scidb_backup.py

Use the scidb_backup.py script for upgrading SciDB or for saving and restoring small amounts of array data. By default this script is located in /opt/scidb/<version>/bin/scidb_backup.py .

Single-file backups of multiple terabytes may take a long time. For such cases, two options exist:

  1. Use parallel backup and restore.  See below for details.
  2. In some circumstances a backup may not be necessary if /wiki/spaces/ESD169/pages/50856133 / /wiki/spaces/ESD169/pages/50856126 is enabled.

Contact Paradigm4 Customer Solutions to discuss a strategy for your environment.


Usage

scidb_backup.py [-h] [-A AUTH_FILE] [-c HOST] [--port PORT] [--ssh-port SSH_PORT] [-v]
                              (-s {binary,opaque,tsv+} | -r undef | --ping | -d | --verify)
                              [-a] [-f PATTERN] [-p] [-z] [--force] [--use-public-ns] [--no-rbac]
                              [--include-ns NAMESPACES | --exclude-ns NAMESPACES]
                              [FOLDER]

Option

Explanation

-s, --save FORMAT

Backup array data to some location (defined either by the directory_name parameter or as a side effect of choosing the --parallel option, see below) using the specified FORMAT.  Supported formats for saving array data are opaque, binary, and tsv+ .

  • Use opaque when upgrading a SciDB installation to the next available version and/or when backing up arrays for other reasons.
  • Use binary or tsv+ when upgrading a SciDB installation if you have skipped more than one version.  Using binary format is faster than tsv+, but is harder for non-SciDB applications to decode (see Binary Files).

-r, --restore FORMAT

Restore array data from some location (defined by the directory_name parameter or as a side effect of choosing the --parallel option, see below).

The FORMAT parameter is a legacy of older versions of the backup script.  In recent releases the format used to store a backup archive is encoded in the archive's metadata.  Use undef as the FORMAT parameter on restore to avoid "format mismatch" warnings.  (You can also use any of the supported -s/--save formats, but your choice should match the format encoded in the the backup archive.)

-d, --delete

Delete backup folder(s) on all hosts.

--pingCheck SSH connectivity to all instance servers.  Ensures that all cluster instances are reachable before starting a backup or restore.
--verifyVerify that an existing backup archive folder is internally consistent.  Done automatically after --save and prior to --restore.

-p, --parallel

When parallel is NOT specified (default behavior), scidb_backup.py creates a single file for each array that contains all of the array's data in the directory_name directory on the coordinator.

When specified, the parallel option causes scidb_backup.py to create a number of sub-directories, one per SciDB instance, on each instance's physical server using the directory_name as a base. Then scidb_backup.py writes each instance's portion of the array data into its per-server directory. The metadata files are written to the coordinator instance's directory.

The --parallel option is the fastest way to save and restore SciDB array data.

-a, --allVersions

Save or restore all versions of SciDB arrays. Potentially a lot of data.

-f, --filter PATTERN

Save or restore arrays whose names match the Python regular expression specified by PATTERN. Escape or quote as necessary to avoid shell expansion.

--include-ns NSLISTSave or restore only those arrays contained in NSLIST, a comma-separated list of namespace names.  Enterprise Edition only.
--exclude-ns NSLISTSave or restore only those arrays NOT contained in NSLIST, a comma-separated list of namespace names.  Enterprise Edition only.
--no-rbacDo not restore role-based access control metadata when restoring from a backup archive.  Enterprise Edition only.
--use-public-ns

Ignore namespace information in a backup archive and try to restore all arrays into the public namespace.  Use to restore an Enterprise Edition backup archive into a Community Edition cluster.

-z, --zip

Compress data with gzip when saving, and decompress when restoring.

--force

Force silent removal of arrays before restoring them.

  • If --force is specified all arrays kept in the backup archive are removed from SciDB prior to the restore. 
  • If --force is NOT specified and there are one or more arrays present in both the backup archive and in SciDB, scidb_backup.py lists the arrays that would have been overwritten and quits.  Remove or rename the arrays and re-run the restore operation.
--auth-file AUTHFILESpecifies a SciDB user authentication file.  For Enterprise Edition namespaces mode only.  See Security.
--port PORTUse to specify a SciDB connection port other than the default port.
--ssh-port PORTNetwork port to use for SSH communication, if other than the default.
--host HOSTNetwork name of the SciDB coordinator host.
FOLDER

When creating a backup without the --parallel option (the default), the FOLDER specifies the backup archive directory name into which SciDB writes the backup data files (one per array) and the per-backup metadata files (files named beginning with a percent sign, %). On restore, the FOLDER specifies the directory from which SciDB should read metadata and per-array data files.

When using the --parallel option, the FOLDER is used as the prefix for several per-instance directory names, one for each instance participating in the parallel save or restore.  For example, if you do a parallel save on a four instance cluster using the FOLDER name /tmp/myBkp then four directories will be created: /tmp/myBkp.0, /tmp/myBkp.1, /tmp/myBkp.2, and /tmp/myBkp.3 .  If you later do a parallel restore using FOLDER name /tmp/myBkp, the script will look for those four directory names.

You must run scidb_backup.py (in either -s/--save or -r/--restore mode) from the coordinator host.

Access Control Considerations

Role-based access control (RBAC) is only available in SciDB Enterprise Edition.

Each --save operation makes a complete copy of access control records in the backup archive.  Subsequent --restore operations restore all the access control records if restoring into a newly initialized cluster.  When restoring into a cluster that has pre-existing access control records, some access control records will not be restored.  The rules for deciding which records are restored and which not are described in the rbactool.py section.  If these rules do not work for your site security policy, you can restore a backup archive with the --no-rbac option and then use rbactool.py and the %rbac.json file from the archive to set up access control as you like.

Backing Up All SciDB Arrays

To back up all array data in a SciDB installation into the /tmp directory, do the following:

  1. Locate scidb_backup.py in the tools and utilities directory: /opt/scidb/<version>/bin/scidb_backup.py by default. 
  2. Run scidb_backup.py from the coordinator host using the -s/--save option, specifying that backup files are to be written in opaque format to a folder in /tmp on the coordinator instance.

    $ iquery -aq "project ( list('arrays'), name)"
    {No} name
    {0} 'Bar'
    {1} 'Foo'
    $
    $ scidb_backup.py -s opaque /tmp/backup
    [scidb_backup] Archiving public.Bar
    [scidb_backup] Archiving public.Foo
    [scidb_backup] Saved 2 of 2 arrays
    [scidb_backup] Verifying /tmp/backup
    [scidb_backup] Backup /tmp/backup verified
    $
    $ ls -lat /tmp/backup/
    total 24
    drwxrwxrwt. 23 root root 4096 Apr  8 17:42 ../
    drwx------   2 scidb  scidb    63 Apr  8 17:38 ./
    -rw-rw-r--   1 scidb  scidb   705 Apr  8 17:38 Foo
    -rw-rw-r--   1 scidb  scidb   705 Apr  8 17:38 Bar
    -rw-------   1 scidb  scidb   346 Apr  8 17:38 %rbac.json
    -rw-------   1 scidb  scidb   282 Apr  8 17:38 %manifest
    $ 
    

     

  3. At this point, you can copy the backup files off the coordinator from the /tmp/backup directory.


Backing Up Selected Arrays

If you want to save only some arrays, specify those array names with the -f/--filter option. For example, suppose you wanted to save only the Foo array. 


$ scidb_backup.py -s opaque -f Foo /tmp/backup
[scidb_backup] Archiving public.Foo
[scidb_backup] Saved 1 of 1 arrays
[scidb_backup] Verifying /tmp/backup
[scidb_backup] Backup /tmp/backup verified
$ 
$ ls -lat /tmp/backup/
total 20
drwx------   2 scidb  scidb    52 Apr  8 17:48 ./
-rw-rw-r--   1 scidb  scidb   705 Apr  8 17:48 Foo
drwxrwxrwt. 23 root root 4096 Apr  8 17:48 ../
-rw-------   1 scidb  scidb   346 Apr  8 17:48 %rbac.json
-rw-------   1 scidb  scidb   236 Apr  8 17:48 %manifest
$ 

The -f/--filter option parameter can be a Python regular expression. 

Restoring Arrays SciDB

To restore all SciDB instances, do the following:

  1. Locate scidb_backup.py in: /opt/scidb/<version>/bin/scidb_backup.py, and locate the directory containing the backup to be restored.
  2. Run scidb_backup.py from the coordinator host using the restore option.  You can use format undef, since the backup archive remembers the format it was stored with.

    $ iquery -aq "project(list('arrays'), name)"
    {No} name
    $
    $ scidb_backup.py -r undef /tmp/backup
    [scidb_backup] Verifying /tmp/backup
    [scidb_backup] Backup /tmp/backup verified
    [scidb_backup] Restoring public.Foo
    [scidb_backup] Restored 1 of 1 arrays
    $ 
    $ iquery -aq "project(list('arrays'), name)"
    {No} name
    {0} 'Foo'
    $ 

Parallel Backup of SciDB Arrays

This example shows how to use the parallel save option.

  1. We start out with two arrays.  For demonstration purposes we compute sums on a few of their attributes.  After restoring, we'll see that these sums are the same.

    $ iquery -aq "project(list(), name)"
    {No} name
    {0} 'POINTS'
    {1} 'SAMPLES'
    $ 
    $ iquery -aq "aggregate(POINTS, sum(w))"
    {i} w_sum
    {0} 500
    $ 
    $ iquery -aq "aggregate(SAMPLES, sum(v))"
    {i} v_sum
    {0} 1.08548e+12
    $ 
  2. Run scidb_backup.py from the coordinator host to save all arrays in parallel, using compressed opaque format.

    $ scidb_backup.py --save opaque --zip --parallel /tmp/Backup
    [scidb_backup] Archiving zipped public.POINTS
    [scidb_backup] Archiving zipped public.SAMPLES
    [scidb_backup] Saved 2 of 2 arrays
    [scidb_backup] Verifying /tmp/Backup
    [scidb_backup] Backup /tmp/Backup verified
    $ 
  3. Examining the created archive folders, we see there is one for each instance.  This cluster consists of four instances running on a single server.  Had the cluster contained multiple servers, each server's /tmp directory would contain only the per-instance folders for the data on that server.  Only the per-instance folder on the coordinator contains the %manifest and %rbac.json metadata files.

    $ ls -ld  /tmp/Backup*
    drwx------ 2 scidb scidb 70 Apr  9 15:12 /tmp/Backup.0/
    drwx------ 2 scidb scidb 35 Apr  9 15:12 /tmp/Backup.1/
    drwx------ 2 scidb scidb 35 Apr  9 15:12 /tmp/Backup.2/
    drwx------ 2 scidb scidb 35 Apr  9 15:12 /tmp/Backup.3/
    $ 
    $ ls -l  /tmp/Backup.0
    total 16
    -rw------- 1 scidb scidb  320 Apr  9 15:09 %manifest
    -rw------- 1 scidb scidb 2366 Apr  9 15:09 POINTS
    -rw------- 1 scidb scidb  346 Apr  9 15:09 %rbac.json
    -rw------- 1 scidb scidb 1731 Apr  9 15:09 SAMPLES
    $ 
    $ ls -l  /tmp/Backup.3
    total 8
    -rw------- 1 scidb scidb 3346 Apr  9 15:09 POINTS
    -rw------- 1 scidb scidb 2411 Apr  9 15:09 SAMPLES
    $ 
    $ file /tmp/Backup.3/POINTS
    /tmp/Backup.3/POINTS: gzip compressed data, from Unix, last modified: Tue Apr  9 15:09:20 2019
    $ 
  4. Now we remove the arrays from SciDB.

    $ iquery -aq "remove(POINTS); remove(SAMPLES)"
    Query was executed successfully
    Query was executed successfully
    $ 
  5. Now restore the array data from the archive.  When restoring, you do not need to remember the archive format, or the other options used to create the archive.  All that information is encoded in the archive metadata.  So long as you remember the base name of the archive (/tmp/Backup), the scidb_backup.py script figures out the rest.

    $ scidb_backup.py --restore undef /tmp/Backup
    [scidb_backup] Verifying /tmp/Backup
    [scidb_backup] Backup /tmp/Backup verified
    [scidb_backup] Restoring zipped public.POINTS
    [scidb_backup] Restoring zipped public.SAMPLES
    [scidb_backup] Restored 2 of 2 arrays
    $ 
  6. Finally we run the summation aggregates again as a quick check that the array data is intact.

    $ iquery -aq "aggregate(POINTS, sum(w))"
    {i} w_sum
    {0} 500
    $ 
    $ iquery -aq "aggregate(SAMPLES, sum(v))"
    {i} v_sum
    {0} 1.08548e+12
    $