Details
-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 3.14.0
-
Fix Version/s: None
-
Component/s: HA
-
Labels:
-
Notability:2
Description
In a 3 node HA cluster running 3.14.0-04 one of the nodes got an OOM due to NEXUS-17896:
2019-02-13 15:24:53,653+0000 WARN [Timer-1] az-1a *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Timeout (14717ms) on waiting for synchronous responses from nodes=[3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807, 634B8375-ED8006DD-5647C452-267D010D-92B8E0C4] responsesSoFar=[3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807] request=(id=0.12304303 task=gossip timestamp: 1550071478935 lockManagerServer: 3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807)
2019-02-13 15:24:53,664+0000 WARN [Timer-1] az-1a *SYSTEM com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F]->[634B8375-ED8006DD-5647C452-267D010D-92B8E0C4] Server '634B8375-ED8006DD-5647C452-267D010D-92B8E0C4' did not respond to the gossip message (db=analytics, timeout=10000ms), but cannot be set OFFLINE by configuration
2019-02-13 15:25:09,556+0000 INFO [elasticsearch[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F][scheduler]T#1] az-1a *SYSTEM org.elasticsearch.monitor.jvm - [7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] [gc][old][8627056][312] duration [15.8s], collections [3]/[15.9s], total [15.8s]/[15.2m], memory [3gb]>[3.1gb]/[3.9gb], all_pools {[young] [440.3mb]>[452.3mb]/[455mb]}{[survivor] [0b]>[0b]/[455mb]}{[old] [2.6gb]>[2.6gb]/[2.6gb]}
2019-02-13 15:28:50,437+0000 ERROR [qtp1160832246-1624636] az-1a GRZ0 org.sonatype.nexus.extdirect.internal.ExtDirectServlet - Failed to invoke action method: rapture_State.rapture_State_get, java-method: org.sonatype.nexus.rapture.internal.state.StateComponent.getState
java.lang.OutOfMemoryError: GC overhead limit exceeded
One of the two other nodes was able to keep functioning, but the other one started getting many errors. This caused quorum to be lost, and made recovery difficult.
It is expected that losing one node in a 3 node HA cluster should not cause failures in the other two nodes.
2019-02-13 15:33:46,221+0000 ERROR [OrientDB DistributedWorker node=3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807 db=config id=-4] az-1b *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedWorker - [3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807]->[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Error on sending response '3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807' back (reqId=0.12304330 err=com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F')
com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F'
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getClusterMemberByName(OHazelcastPlugin.java:705)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getRemoteServer(OHazelcastPlugin.java:622)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:433)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:413)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:399)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:127)
2019-02-13 15:33:53,725+0000 INFO [hz.nexus.HealthMonitor] az-1b *SYSTEM com.hazelcast.internal.diagnostics.HealthMonitor - [10.116.16.39]:5701 [nexus] [3.10.3] processors=4, physical.memory.total=15.7G, physical.memory.free=1.7G, swap.space.total=0, swap.space.free=0, heap.memory.used=2.9G, heap.memory.free=1.1G, heap.memory.total=4.0G, heap.memory.max=4.0G, heap.memory.used/total=72.18%, heap.memory.used/max=72.18%, minor.gc.count=5294, minor.gc.time=96588ms, major.gc.count=6, major.gc.time=1834ms, load.process=0.35%, load.system=0.09%, load.systemAverage=0.04, thread.count=211, thread.peakCount=307, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=239974629, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=0, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1
2019-02-13 15:33:56,223+0000 ERROR [OrientDB DistributedWorker node=3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807 db=component id=-4] az-1b *SYSTEM com.orientechnologies.orient.server.distributed.impl.ODistributedWorker - [3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807]->[7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F] Error on sending response '3766F7B2-503514B0-B3A6A073-AFEE3A0C-0DF24807' back (reqId=0.12304331 err=com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F')
com.orientechnologies.orient.server.distributed.ODistributedException: Cannot find node '7ECE19CB-79D855DA-DE61B9E6-1ED747CE-4B2C114F'
at om.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getClusterMemberByName(OHazelcastPlugin.java:705)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.getRemoteServer(OHazelcastPlugin.java:622)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:433)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.sendResponseBack(ODistributedWorker.java:413)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:399)
at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:127)