scidb_backup.py
Use the scidb_backup.py script for upgrading SciDB or for saving and restoring small amounts of array data. By default this script is located in /opt/scidb/<version>/bin/scidb_backup.py .
Single-file backups of multiple terabytes may take a long time. For such cases, two options exist:
- Use parallel backup and restore. See below for details.
- In some circumstances a backup may not be necessary if /wiki/spaces/ESD169/pages/50856133 / /wiki/spaces/ESD169/pages/50856126 is enabled.
Contact Paradigm4 Customer Solutions to discuss a strategy for your environment.
Usage scidb_backup.py [-h] [-A AUTH_FILE] [-c HOST] [--port PORT] [--ssh-port SSH_PORT] [-v] | |
---|---|
Option | Explanation |
-s, --save FORMAT | Backup array data to some location (defined either by the directory_name parameter or as a side effect of choosing the --parallel option, see below) using the specified FORMAT. Supported formats for saving array data are opaque, binary, and tsv+ .
|
-r, --restore FORMAT | Restore array data from some location (defined by the directory_name parameter or as a side effect of choosing the --parallel option, see below). The FORMAT parameter is a legacy of older versions of the backup script. In recent releases the format used to store a backup archive is encoded in the archive's metadata. Use undef as the FORMAT parameter on restore to avoid "format mismatch" warnings. (You can also use any of the supported -s/--save formats, but your choice should match the format encoded in the the backup archive.) |
-d, --delete | Delete backup folder(s) on all hosts. |
--ping | Check SSH connectivity to all instance servers. Ensures that all cluster instances are reachable before starting a backup or restore. |
--verify | Verify that an existing backup archive folder is internally consistent. Done automatically after --save and prior to --restore. |
-p, --parallel | When parallel is NOT specified (default behavior), scidb_backup.py creates a single file for each array that contains all of the array's data in the directory_name directory on the coordinator. When specified, the parallel option causes scidb_backup.py to create a number of sub-directories, one per SciDB instance, on each instance's physical server using the directory_name as a base. Then scidb_backup.py writes each instance's portion of the array data into its per-server directory. The metadata files are written to the coordinator instance's directory. The --parallel option is the fastest way to save and restore SciDB array data. |
-a, --allVersions | Save or restore all versions of SciDB arrays. Potentially a lot of data. |
-f, --filter PATTERN | Save or restore arrays whose names match the Python regular expression specified by PATTERN. Escape or quote as necessary to avoid shell expansion. |
--include-ns NSLIST | Save or restore only those arrays contained in NSLIST, a comma-separated list of namespace names. Enterprise Edition only. |
--exclude-ns NSLIST | Save or restore only those arrays NOT contained in NSLIST, a comma-separated list of namespace names. Enterprise Edition only. |
--no-rbac | Do not restore role-based access control metadata when restoring from a backup archive. Enterprise Edition only. |
--use-public-ns | Ignore namespace information in a backup archive and try to restore all arrays into the public namespace. Use to restore an Enterprise Edition backup archive into a Community Edition cluster. |
-z, --zip | Compress data with gzip when saving, and decompress when restoring. |
--force | Force silent removal of arrays before restoring them.
|
--auth-file AUTHFILE | Specifies a SciDB user authentication file. For Enterprise Edition namespaces mode only. See Security. |
--port PORT | Use to specify a SciDB connection port other than the default port. |
--ssh-port PORT | Network port to use for SSH communication, if other than the default. |
--host HOST | Network name of the SciDB coordinator host. |
FOLDER | When creating a backup without the --parallel option (the default), the FOLDER specifies the backup archive directory name into which SciDB writes the backup data files (one per array) and the per-backup metadata files (files named beginning with a percent sign, When using the --parallel option, the FOLDER is used as the prefix for several per-instance directory names, one for each instance participating in the parallel save or restore. For example, if you do a parallel save on a four instance cluster using the FOLDER name |
You must run scidb_backup.py (in either -s/--save or -r/--restore mode) from the coordinator host.
Access Control Considerations
Role-based access control (RBAC) is only available in SciDB Enterprise Edition.
Each --save
operation makes a complete copy of access control records in the backup archive. Subsequent --restore
operations restore all the access control records if restoring into a newly initialized cluster. When restoring into a cluster that has pre-existing access control records, some access control records will not be restored. The rules for deciding which records are restored and which not are described in the rbactool.py section. If these rules do not work for your site security policy, you can restore a backup archive with the --no-rbac
option and then use rbactool.py and the %rbac.json file from the archive to set up access control as you like.
Backing Up All SciDB Arrays
To back up all array data in a SciDB installation into the /tmp directory, do the following:
- Locate scidb_backup.py in the tools and utilities directory: /opt/scidb/<version>/bin/scidb_backup.py by default.
Run scidb_backup.py from the coordinator host using the -s/--save option, specifying that backup files are to be written in opaque format to a folder in /tmp on the coordinator instance.
$ iquery -aq "project ( list('arrays'), name)" {No} name {0} 'Bar' {1} 'Foo' $ $ scidb_backup.py -s opaque /tmp/backup [scidb_backup] Archiving public.Bar [scidb_backup] Archiving public.Foo [scidb_backup] Saved 2 of 2 arrays [scidb_backup] Verifying /tmp/backup [scidb_backup] Backup /tmp/backup verified $ $ ls -lat /tmp/backup/ total 24 drwxrwxrwt. 23 root root 4096 Apr 8 17:42 ../ drwx------ 2 scidb scidb 63 Apr 8 17:38 ./ -rw-rw-r-- 1 scidb scidb 705 Apr 8 17:38 Foo -rw-rw-r-- 1 scidb scidb 705 Apr 8 17:38 Bar -rw------- 1 scidb scidb 346 Apr 8 17:38 %rbac.json -rw------- 1 scidb scidb 282 Apr 8 17:38 %manifest $
- At this point, you can copy the backup files off the coordinator from the /tmp/backup directory.
Backing Up Selected Arrays
If you want to save only some arrays, specify those array names with the -f/--filter option. For example, suppose you wanted to save only the Foo array.
$ scidb_backup.py -s opaque -f Foo /tmp/backup [scidb_backup] Archiving public.Foo [scidb_backup] Saved 1 of 1 arrays [scidb_backup] Verifying /tmp/backup [scidb_backup] Backup /tmp/backup verified $ $ ls -lat /tmp/backup/ total 20 drwx------ 2 scidb scidb 52 Apr 8 17:48 ./ -rw-rw-r-- 1 scidb scidb 705 Apr 8 17:48 Foo drwxrwxrwt. 23 root root 4096 Apr 8 17:48 ../ -rw------- 1 scidb scidb 346 Apr 8 17:48 %rbac.json -rw------- 1 scidb scidb 236 Apr 8 17:48 %manifest $
The -f/--filter option parameter can be a Python regular expression.
Restoring Arrays SciDB
To restore all SciDB instances, do the following:
- Locate scidb_backup.py in: /opt/scidb/<version>/bin/scidb_backup.py, and locate the directory containing the backup to be restored.
Run scidb_backup.py from the coordinator host using the restore option. You can use format undef, since the backup archive remembers the format it was stored with.
$ iquery -aq "project(list('arrays'), name)" {No} name $ $ scidb_backup.py -r undef /tmp/backup [scidb_backup] Verifying /tmp/backup [scidb_backup] Backup /tmp/backup verified [scidb_backup] Restoring public.Foo [scidb_backup] Restored 1 of 1 arrays $ $ iquery -aq "project(list('arrays'), name)" {No} name {0} 'Foo' $
Parallel Backup of SciDB Arrays
This example shows how to use the parallel save option.
We start out with two arrays. For demonstration purposes we compute sums on a few of their attributes. After restoring, we'll see that these sums are the same.
$ iquery -aq "project(list(), name)" {No} name {0} 'POINTS' {1} 'SAMPLES' $ $ iquery -aq "aggregate(POINTS, sum(w))" {i} w_sum {0} 500 $ $ iquery -aq "aggregate(SAMPLES, sum(v))" {i} v_sum {0} 1.08548e+12 $
Run scidb_backup.py from the coordinator host to save all arrays in parallel, using compressed opaque format.
$ scidb_backup.py --save opaque --zip --parallel /tmp/Backup [scidb_backup] Archiving zipped public.POINTS [scidb_backup] Archiving zipped public.SAMPLES [scidb_backup] Saved 2 of 2 arrays [scidb_backup] Verifying /tmp/Backup [scidb_backup] Backup /tmp/Backup verified $
Examining the created archive folders, we see there is one for each instance. This cluster consists of four instances running on a single server. Had the cluster contained multiple servers, each server's /tmp directory would contain only the per-instance folders for the data on that server. Only the per-instance folder on the coordinator contains the %manifest and %rbac.json metadata files.
$ ls -ld /tmp/Backup* drwx------ 2 scidb scidb 70 Apr 9 15:12 /tmp/Backup.0/ drwx------ 2 scidb scidb 35 Apr 9 15:12 /tmp/Backup.1/ drwx------ 2 scidb scidb 35 Apr 9 15:12 /tmp/Backup.2/ drwx------ 2 scidb scidb 35 Apr 9 15:12 /tmp/Backup.3/ $ $ ls -l /tmp/Backup.0 total 16 -rw------- 1 scidb scidb 320 Apr 9 15:09 %manifest -rw------- 1 scidb scidb 2366 Apr 9 15:09 POINTS -rw------- 1 scidb scidb 346 Apr 9 15:09 %rbac.json -rw------- 1 scidb scidb 1731 Apr 9 15:09 SAMPLES $ $ ls -l /tmp/Backup.3 total 8 -rw------- 1 scidb scidb 3346 Apr 9 15:09 POINTS -rw------- 1 scidb scidb 2411 Apr 9 15:09 SAMPLES $ $ file /tmp/Backup.3/POINTS /tmp/Backup.3/POINTS: gzip compressed data, from Unix, last modified: Tue Apr 9 15:09:20 2019 $
Now we remove the arrays from SciDB.
$ iquery -aq "remove(POINTS); remove(SAMPLES)" Query was executed successfully Query was executed successfully $
Now restore the array data from the archive. When restoring, you do not need to remember the archive format, or the other options used to create the archive. All that information is encoded in the archive metadata. So long as you remember the base name of the archive (/tmp/Backup), the scidb_backup.py script figures out the rest.
$ scidb_backup.py --restore undef /tmp/Backup [scidb_backup] Verifying /tmp/Backup [scidb_backup] Backup /tmp/Backup verified [scidb_backup] Restoring zipped public.POINTS [scidb_backup] Restoring zipped public.SAMPLES [scidb_backup] Restored 2 of 2 arrays $
Finally we run the summation aggregates again as a quick check that the array data is intact.
$ iquery -aq "aggregate(POINTS, sum(w))" {i} w_sum {0} 500 $ $ iquery -aq "aggregate(SAMPLES, sum(v))" {i} v_sum {0} 1.08548e+12 $