Failure Handling

Enabling replication in the Enterprise Edition ensures that SciDB handles most system failures in a way that guarantees data consistencyThis topic discusses temporary instance failures that eventually come back online without any user intervention. For situations where one or more instances fail, see Fault Tolerance.

Failure Types

During a failure, all in-progress queries abort and the cluster changes to one of the following degraded modes of service.

Member Instances Fail but Cluster has READ Quorum

  • Read queries use replica chunks' data as necessary to access copies of data whose primary chunks were on the failed instances. Therefore, array scans and analytic queries – any read-only queries – proceed if there is READ quorum. Read quorum for a cluster (when the redundancy parameter = r) is the minimum number of online servers that can successfully execute read queries. This value is the number of servers in a cluster − r.  An online server contains no failed instances.  If more instances or entire servers fail and the number of online server drops below this value, the cluster cannot support read queries on array data.
  • Quorum restores automatically when failed instance(s) recover(s)  and become(s) live again.

Member Instances Fail and Cluster Size Drops below READ Quorum

  • All transactions causes a Not enough online instances to execute query error.
  • Quorum restores automatically when a failed instance recovers and rejoins the cluster.

System Catalog Server Fails or is Unreachable.

  • Query execution cannot proceed if the system catalog server fails or is unreachable. Once the system catalog server is back online, the system starts accepting queries again.

If you try running a write query with a degraded cluster, even with a READ quorum, an error similar to the following appears:

Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_NO_QUORUM
Error description: Query processor error. Instance liveness has changed.