Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-19117

Quorum lost in 3 node cluster when one node ran out of heap space.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.14.0
    • Fix Version/s: None
    • Component/s: HA
    • Labels:
    • Notability:
      2

      Description

      In a 3 node HA cluster running 3.14.0-04 one of the nodes got an OOM due to NEXUS-17896:

      2019-02-13 15:24:53,653+0000 WARN  [Timer-1] az-1a *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Timeout (14717ms) on waiting for synchronous responses from nodes=[3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807, 634B8375-ED8006DD-5647C452-267D010D-92B8E0C4] responsesSoFar=[3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807] request=(id=0.12304303 task=gossip timestamp: 1550071478935 lockManagerServer: 3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807)
      2019-02-13 15:24:53,664+0000 WARN  [Timer-1] az-1a *SYSTEM com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F]->[634B8375-ED8006DD-5647C452-267D010D-92B8E0C4] Server '634B8375-ED8006DD-5647C452-267D010D-92B8E0C4' did not respond to the gossip message (db=analytics, timeout=10000ms), but cannot be set OFFLINE by configuration
      2019-02-13 15:25:09,556+0000 INFO  [elasticsearch[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F][scheduler]T#1] az-1a *SYSTEM org.elasticsearch.monitor.jvm - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] [gc][old][8627056][312] duration [15.8s], collections [3]/[15.9s], total [15.8s]/[15.2m], memory [3gb]>[3.1gb]/[3.9gb], all_pools {[young] [440.3mb]>[452.3mb]/[455mb]}{[survivor] [0b]>[0b]/[455mb]}{[old] [2.6gb]>[2.6gb]/[2.6gb]}
      2019-02-13 15:28:50,437+0000 ERROR [qtp1160832246-1624636] az-1a GRZ0 org.sonatype.nexus.extdirect.internal.ExtDirectServlet - Failed to invoke action method: rapture_State.rapture_State_get, java-method: org.sonatype.nexus.rapture.internal.state.StateComponent.getState
      java.lang.OutOfMemoryError: GC overhead limit exceeded

      One of the two other nodes was able to keep functioning, but the other one started getting many errors. This caused quorum to be lost, and made recovery difficult.

      It is expected that losing one node in a 3 node HA cluster should not cause failures in the other two nodes.

      2019-02-13 15:33:46,221+0000 ERROR [OrientDB DistributedWorker node=3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807 db=config id=-4] az-1b *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedWorker - [3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807]->[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Error on sending response '3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807' back (reqId=0.12304330 err=com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F')
      com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F'
      at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getClusterMemberByName(OHazelcastPlugin.java:705)
      at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getRemoteServer(OHazelcastPlugin.java:622)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:433)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:413)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:399)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:127)
      2019-02-13 15:33:53,725+0000 INFO [hz.nexus.HealthMonitor] az-1b *SYSTEM com.hazelcast.internal.diagnostics.HealthMonitor - [10.116.16.39]:5701 [nexus] [3.10.3] processors=4, physical.memory.total=15.7G, physical.memory.free=1.7G, swap.space.total=0, swap.space.free=0, heap.memory.used=2.9G, heap.memory.free=1.1G, heap.memory.total=4.0G, heap.memory.max=4.0G, heap.memory.used/total=72.18%, heap.memory.used/max=72.18%, minor.gc.count=5294, minor.gc.time=96588ms, major.gc.count=6, major.gc.time=1834ms, load.process=0.35%, load.system=0.09%, load.systemAverage=0.04, thread.count=211, thread.peakCount=307, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=239974629, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=0, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1
      2019-02-13 15:33:56,223+0000 ERROR [OrientDB DistributedWorker node=3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807 db=component id=-4] az-1b *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedWorker - [3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807]->[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Error on sending response '3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807' back (reqId=0.12304331 err=com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F')
      com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F'
      at om.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getClusterMemberByName(OHazelcastPlugin.java:705)
      at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getRemoteServer(OHazelcastPlugin.java:622)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:433)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:413)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:399)
      at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:127)

        Attachments

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            rseddon Rich Seddon
            Last Updated By:
            Rich Seddon Rich Seddon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:

                tigCommentSecurity.panel-title