Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-18185

Database out of sync in HA cluster - mitigation and recovery


    • Type: Story
    • Status: Done
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.16.0
    • Component/s: HA


      In certain circumstances database records in an HA cluster can become out of sync with the other nodes in ways that OrientDb is unable to reconcile.  The symptoms of this is that messages like these will be seen in the log:


      2018-09-10 12:49:52,698-0400 ERROR [nexus_QuartzSchedulerThread] EnterpriseSandbox_Secondary_2 *SYSTEM com.orientechnologies.orient.core.db.OPartitionedDatabasePool$DatabaseDocumentTxPooled - Error on transaction commit `009A44E1`
      com.orientechnologies.orient.server.distributed.ODistributedException: Quorum (2) cannot be reached on server 'DE8A74B3-C1002895-1912C78F-3FC2A144-CC0335B3' database 'config' because it is major than available nodes (1)


      2018-09-19 00:00:05,016-0400 ERROR [Timer-1] EnterpriseSandbox_Tertiary_3 *SYSTEM com.sonatype.nexus.hazelcast.internal.orient.SharedHazelcastPlugin - [8268DBA0-AA8B5518-9A85E312-EDF0E6F3-E03887CA]<-[DE8A74B3-C1002895-1912C78F-3FC2A144-CC0335B3] Error on installing database 'config' in /binrepo/nexus/nexus-work3/nexus3/db/config
      com.orientechnologies.orient.server.distributed.ODistributedException: Skipped request id=2.214812 task=deploy_db on database 'config' because LSN{segment=40, position=7991} < current LSN{segment=41, position=17780}

       Additionally the node in question will likely start returning error responses.

      We have identified some of the cases where this can occur, and work is underway to fix those.

      But there are other potential cases where a node can get out of sync, and realistically it may not be possible to prevent all of these, since some of them can involve external conditions (such as clock jumps) that are not in our control.  So in addition to fixing cases where circumstances exist that can cause a node to get unresolvable record conflicts we should also provide Nexus administrators with tools needed to identify when a node has conflicts that can't be resolve automatically, and a UI where an administrator can manually resynchronize nodes. 



          Issue Links



              Unassigned Unassigned
              rseddon Rich Seddon
              Last Updated By:
              Michael Prescott Michael Prescott
              1 Vote for this issue
              10 Start watching this issue


                Date of First Response: