Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-14969

HA-C nodes do not rejoin their cluster after cluster shutdown

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.6.1
    • Fix Version/s: 3.8.0
    • Component/s: HA
    • Labels:
    • Story Points:
      3

      Description

      Given an HA-C cluster, when nodes that were previously members of the cluster have been shut down, the nodes will not always properly rejoin the cluster. The following errors have been encountered in the log files:

      2017-11-21 07:18:04,448+0000 ERROR [FelixStartLevel] *SYSTEM com.sonatype.nexus.hazelcast.internal.orient.SharedHazelcastPlugin - [F7E2EC33-6D42028D-074D8BDF-2EED45CB-ABC97A45] No LSN found for delta sync for database 'accesslog'. Asking for full database sync...
      

      Additionally, there may be error messages similar to the following:

      Caused by: com.orientechnologies.orient.server.distributed.ODistributedException: Quorum (1) cannot be reached on server 'XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX' database 'config' because it is major than the nodes in quorum (0)
      

      Reproduce

      To reproduce, here is a brief description of what was done:

      1) Start up Nexus1 in the 3-node cluster
      2) Start up Nexus2 in the 3-node cluster
      3) Start up Nexus3 in the 3-node cluster
      4) Add some config to Nexus like new repositories, new users, etc.
      5) Stop Nexus1
      6) Stop Nexus2
      7) Stop Nexus3
      8) Try to start up Nexus1 or Nexus2 but an error will occur:

      2017-11-21 16:29:09,561+0000 INFO [FelixStartLevel] *SYSTEM org.sonatype.nexus.extender.NexusLifecycleManager - Start TASKS 
      2017-11-21 16:29:09,656+0000 WARN [FelixStartLevel] *SYSTEM org.sonatype.nexus.quartz.internal.orient.JobStoreImpl - Execution failed 
      com.orientechnologies.orient.server.distributed.ODistributedException: Quorum (2) cannot be reached on server 'F7E2EC33-6D42028D-074D8BDF-2EED45CB-ABC97A45' database 'config' because it is major than available nodes (1) 
      at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.calculateQuorum(ODistributedDatabaseImpl.java:1061)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.send2Nodes(ODistributedDatabaseImpl.java:430) 
      at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.sendRequest(ODistributedAbstractPlugin.java:584)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedTransactionManager.commit(ODistributedTransactionManager.java:162)
      

      9) If you try to start Nexus3, then it starts up.

      Expected

      There should not be a defined order to which nodes must rejoin the cluster after being shutdown.

      Workaround

      The only workaround seems to be to abandon the two nodes that won't start, and add two new nodes to the one working node.

        Attachments

          Activity

            People

            Assignee:
            wwannemacher Wes Wannemacher
            Reporter:
            wwannemacher Wes Wannemacher
            Last Updated By:
            Peter Lynch
            Team:
            Nexus - Platform
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:
              Date of First Response:

                tigCommentSecurity.panel-title