Fault Tolerance

Fault tolerance is an Enterprise Edition feature that lets a SciDB cluster continue its intended operation when one or more instances fail. 

A permanent failure of the Postgres server containing the SciDB System Catalog is unrecoverable. Regular backups of the SciDB data can be a mitigation strategy.  See scidb_backup.py.

To maintain the highest level of fault tolerance:

  1. Configure SciDB to run as a service as described in Running SciDB.
  2. Enable replication as described in Enabling Replication.

Degraded Mode Operation

When one or more instances of your SciDB cluster have failed, WRITE queries are disallowed but READ queries can still succeed.  The cluster is said to be in degraded mode.  To determine if your cluster is in degraded mode, perform either of the following actions:

  • Attempt to run a WRITE query. For example attempt to create a simple array. If it succeeds, then all instances are online. If any instances are offline, an error occurs. If you attempt to run the query immediately after an instance failure (the liveness-timeout parameter is 120 seconds by default) you may get a connection error. After the timeout period, the "Instance liveness has changed" error message appears.
  • Use list_instances() or list_array_residency() as described in Managing SciDB Instances to determine which arrays are in a degraded mode.

If an array's liveness falls below the threshold of having a READ quorum, the following error appears when the array is accessed:

Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_NO_QUORUM2
Error description: Error during query execution. Not enough online instances to execute query.

When Instances Fail Permanently

When an instance goes offline, all currently running queries fail rather than running to completion. After the liveness-timeout period expires, you can issue read-only queries (assuming that you still have a READ quorum). If the failed instance does not recover, perform the following steps to ensure that you do not lose data:

  1. If you are close to losing your READ quorum, save all of your data, using the SciDB operator save() or the scidb_backup.py utility. Save in any format that you feel comfortable using.
  2. Add/replace instances as described in Managing SciDB Instances or simply remove the dead instances from the cluster membership (assuming there are enough live servers to maintain the required redundancy).
  3. Copy the arrays in danger of being inaccessible to the new set of instances (i.e. membership).
     

Performing the backup or save (step 1) is not required, but may be faster than steps 2 and 3. You can also do the save after step 2. You can reload the data you saved in step 1 with load() (after the liveness of the cluster matches its membership).

Degradation Example

For this example, assume the following scenario:

Your SciDB cluster consists of 16 instances distributed among 4 servers.

  • Your redundancy parameter is set to: redundancy=3.
  • Three of your worker instances have failed, and you have not been able to bring them back online.
  • You have a large amount of important array data that you cannot afford to lose.
  • All of your data is available, and you can still run read-only queries.
  • You cannot run any update transactions, and if another worker instance fails, you may no longer have a Read quorum. A Read quorum no longer exists when the number of failed servers is higher than the specified redundancy value. A single instance failure renders a server failed as far as redundancy is concerned (even if healthy instances are still running on it). Every time an instance fails on a healthy server, a level of redundancy is lost. In the worst case, four instance failures may result in four failed servers and complete loss of service and data.

To prevent data loss, perform the steps in the section above while you still have a read quorum.