Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-18185

Database out of sync in HA cluster - mitigation and recovery

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.16.0
    • Component/s: HA

      Description

      In certain circumstances database records in an HA cluster can become out of sync with the other nodes in ways that OrientDb is unable to reconcile.  The symptoms of this is that messages like these will be seen in the log:

       

      2018-09-10 12:49:52,698-0400 ERROR [nexus_QuartzSchedulerThread] EnterpriseSandbox_Secondary_2 *SYSTEM com.orientechnologies.orient.core.db.OPartitionedDatabasePool$DatabaseDocumentTxPooled - Error on transaction commit `009A44E1`
      com.orientechnologies.orient.server.distributed.ODistributedException: Quorum (2) cannot be reached on server 'DE8A74B3-C1002895-1912C78F-3FC2A144-CC0335B3' database 'config' because it is major than available nodes (1)

       

      2018-09-19 00:00:05,016-0400 ERROR [Timer-1] EnterpriseSandbox_Tertiary_3 *SYSTEM com.sonatype.nexus.hazelcast.internal.orient.SharedHazelcastPlugin - [8268DBA0-AA8B5518-9A85E312-EDF0E6F3-E03887CA]<-[DE8A74B3-C1002895-1912C78F-3FC2A144-CC0335B3] Error on installing database 'config' in /binrepo/nexus/nexus-work3/nexus3/db/config
      com.orientechnologies.orient.server.distributed.ODistributedException: Skipped request id=2.214812 task=deploy_db on database 'config' because LSN{segment=40, position=7991} < current LSN{segment=41, position=17780}

       Additionally the node in question will likely start returning error responses.

      We have identified some of the cases where this can occur, and work is underway to fix those.

      But there are other potential cases where a node can get out of sync, and realistically it may not be possible to prevent all of these, since some of them can involve external conditions (such as clock jumps) that are not in our control.  So in addition to fixing cases where circumstances exist that can cause a node to get unresolvable record conflicts we should also provide Nexus administrators with tools needed to identify when a node has conflicts that can't be resolve automatically, and a UI where an administrator can manually resynchronize nodes. 

       

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              rseddon Rich Seddon
              Last Updated By:
              Michael Prescott Michael Prescott
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Date of First Response:

                  tigCommentSecurity.panel-title