egeria icon indicating copy to clipboard operation
egeria copied to clipboard

[BUG] Egeria Metadata Server crashes silently while processing massive amount of elements

Open juergenhemelt opened this issue 3 years ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

While processing 500+ elements (KafkaTopics from StrimziMonitorIntegrationConnector) the Egeria Metadata Server crashes silently. Last few lines of log:

Tue Aug 16 12:02:43 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:43 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:46 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Relationship
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:48 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Classification
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: Element Classified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:50 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Element
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:53 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Relationship
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:53 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Classification
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: Element Classified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:57 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:57 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Element
Tue Aug 16 12:02:57 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:58 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:58 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances

Expected Behavior

Should have processed every KafkaTopic element.

Steps To Reproduce

  • Install metadata server and view server
  • Configure StrimziMonitorIntegrationConnector so that it connects to a Strimzi installation with 500+ topics
  • Start StrimziMonitorIntegrationConnector
  • Wait about 40 minutes

Environment

- Egeria:3.10
- OS: Kubernetes (Openshift 4.9)
- Java: 11.x

Any Further Information?

No response

juergenhemelt avatar Aug 16 '22 12:08 juergenhemelt

Do you have further indications of what may have happened.

Was anything recorded in the statefullset or pod stats (kubectl describe)

Was the topic connector the only workload?

planetf1 avatar Aug 16 '22 17:08 planetf1

Also, how was the behaviour after the crash? Did the system recover? Was any data lost? Did the connector resume retrieving topics?

planetf1 avatar Aug 16 '22 17:08 planetf1

For example, I had an issue with my Jupyter container (a bug) which led to the container restarting and I'd see something like:

Containers:
  jupyter:
    Container ID:   cri-o://3e6a2f6d264f9e8c6f052f23e49cb2da6a1cf35c869104c0686da689ffd2604f
    Image:          quay.io/odpi/jupyter:lab-3.4.5
    Image ID:       quay.io/odpi/jupyter@sha256:34f7ddbdebf759e4121e25c1ebe06929a118346ecd8a3ea06dbd17ca75fb6c42
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 15 Aug 2022 17:47:49 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    128
      Started:      Mon, 15 Aug 2022 17:47:46 +0100
      Finished:     Mon, 15 Aug 2022 17:47:48 +0100
    Ready:          True
    Restart Count:  1

I'm not sure how useful it would be, but the exit code may help us confirm running out of heap (for example) could be the cause.

What about the pod itself, rather than kubectl logs how about kubectl logs --previous - is there any extra information there?

planetf1 avatar Aug 16 '22 17:08 planetf1

If none of that helps, we probably either need to try and

  • examine code to guess failure
  • analyse/debug, check allocations when integration connector working normally, via an agent (local or in k8s)
  • capture a heap on failure -- complicated by the fact we'd need to write some k8s logic to ensure we store/copy the dump to a persistent volume
  • enable debug (in case this helps. May not...)

planetf1 avatar Aug 16 '22 17:08 planetf1

Just checking the log again - I presume this is the metadata server -- not where you are running the integration connector?

What repository connector are you using? I'm guessing it could be XTDB?

If we can't get any info from the suggestions above quickly, I wonder if it's worth testing also with local graph or in-memory - that wouldn't completely rule a core issue out (as this could relate to timing, concurrent requests etc) but may give some extra insight.

if it is xtdb, what's the xtdb configuration like?

planetf1 avatar Aug 17 '22 08:08 planetf1

We have limited licenses for 'yourkit' java profiler (there are other tools available including using intellij+async profiler for example), which can help in pinning down memory issues, may be an option if the other approaches don't get us answers.

planetf1 avatar Aug 17 '22 08:08 planetf1

We use a local Janusgraph Repository. On the first restart after the crash, the server is unable to connect to the JanusGraph. This looks like a problem with JanusGraph. The oc describe does not show any problems. The previous log file is the one I attached to this issue. I will try to reproduce the error and check the events of the Stateful Set and the Pod.

Would you recommend using XTDB instead of JanusGraph?

juergenhemelt avatar Aug 18 '22 10:08 juergenhemelt

@juergenhemelt I think generally we would suggest xtdb - it looks like a more scalable, reliable approach, and some performance analysis has been done on it.

That being said there's no evidence one way or the other yet to suggest your issue is specific to graph, we don't know. So whilst you could try it, equally it may be good to understand why it broken even with graph.

Were you using the default JanusGraph setup (uses berkeleydb)?

I think capturing the data would help, interesting to see if we can figure out if, for example, the jvm failed due to heap exhaustion. Or perhaps openshift killed it due to resource - perhaps you could also check the openshift dashboard to see what it's reporting?

I'd be interesting in trying to reproduce, though I may struggle to do this until later next week. I presume you just pointed your topic integrator at a strimzi deployment hosting lots of topics? Do you have any assets to share that may speed me trying to reproduce? (server config, chart config etc)?

planetf1 avatar Aug 18 '22 15:08 planetf1

I'm also intrigued by the fact you said the graph failed to work after the first restart?

I think you are using customised helm charts? It's possible a problem there, or perhaps not. Do you have any logs from this failure to work on restart? For example if the metadata collection id stored in the server config document gets out of sync with the actual repository (graph) then egeria will throw an error and, at least in >=3.10, fail to start the server

planetf1 avatar Aug 18 '22 15:08 planetf1

This is an interesting kind of issue - how do we debug apparently the jvm going away. If the cause isn't external and anything shows up in k8s log, I wonder if we need to do something like:

  • set jvm flag to capture heapdump
  • modify the helm charts so that not only is the flag set, but the output either goes to, or is copied to, a persistent volume
  • we can then retrospectively analyze the heap dump

^ I'll raise an issue on this

OR maybe we can somehow modify the startup script of the container so that it does not exit (as you'd normally want) on this error condition, so that we could then interactively connect and analyze

But it still might be easier to just connect to the JVM using an agent, which should give us similar information (ie using YourKit profiler or similar)

For your issue another alternative might be to try and run just the mds server locally

planetf1 avatar Aug 18 '22 15:08 planetf1

I have just switched to XTDB in-memory. The good news is as the performance is tremendously better I can test faster than before. The bad news is that I still have the same problem. I will continue analyzing.

juergenhemelt avatar Aug 19 '22 14:08 juergenhemelt

Good to hear the performance is better. I'm also completing/fixing/merging https://github.com/odpi/egeria-charts/pull/127 which is for our lab demo (notebook) chart to allow spec of additional connectors. I'll look at same in base. This would help me/us setting up a strimzi connector environment, and something I've intended to do for a while.

Please do share any data you capture especially from 'describe' against the k8s objects

planetf1 avatar Aug 19 '22 14:08 planetf1

In deed the oc describeof the pod gives a lastState:

    State:          Running
      Started:      Mon, 22 Aug 2022 10:49:56 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 22 Aug 2022 10:42:06 +0200
      Finished:     Mon, 22 Aug 2022 10:49:55 +0200

Obviously an out of memory issue.

juergenhemelt avatar Aug 22 '22 10:08 juergenhemelt

The ressource limits in Kubernetes were at 1GB memory. I increased it and now it works. For the 547 topics it requires 3.4GB memory. I think that's expected behaviour as I am still on in-memory XTDB.

juergenhemelt avatar Aug 23 '22 06:08 juergenhemelt

Did you have explicit 'resources' entries in your modified charts? Is that where you were setting the 1GB limit. The standard charts don't specify a limit by default

When the pod restarted, did it continue processing ok?. In a k8s environment we need to expect pods to fail - for example nodes may fail or be rebooted, and the pod rescheduled elsewhere. It's important processing then continues normally - including resuming where it left off.

Are there any aspects of that which is problematic with the testing you have done? What happens if you go far beyond 500 topics? I am also thinking if our jam limits are set such that a failure could occur before aggressive garbage collection has kicked in. I can't see any explicit setting of heap in the container, though we do set MaxMetaSpaceSize to 1G -- it's extremely unlike we could get close to that, but may be worth checking./removing. By default heap is 1/2 physical ram, which should be the allocated size of the container

I think this aspect is worth pursuing - we'd mostly expect java heap to be exceeded (for example if we had an effective memory leak) with an exception before the allocated memory on the container ran out ? Need to look if we following best practice and/or how to do better.

planetf1 avatar Aug 24 '22 08:08 planetf1

Yes, I have explicitly set resource limits. This is a policy in my organization. It was set to 1GB. I now switched from XTDB in-memory to RocksDB persistence. That brings the used memory down again to less than 1GB. So the in-memory XTDB was the one which took all the memory. It could be good to check that XTDB is not using more than the available memory and log an exception then. But I can imagine that this is not easy to implement. For me this issue is fixed and you could close it.

juergenhemelt avatar Aug 24 '22 10:08 juergenhemelt

If we get killed by k8s for exceeding limits there's nothing we can do in the java application -- only ensure recovery through the k8s deployment. If we hit a memory exception (ie out of heap) in the jvm there's not much we can do (especially anything that allocates memory). A output message, Maybe (and agree).

I'll keep the issue open to look into this behaviour. there may be something about the container / image we use/ settings that are allowing the jvm to continue unhindered (ie not running aggressive GC) yet where physical is restricted. could we do better. I don't know, but I see it as an important question for production. There's also questions like monitoring our heap best practice, and which gc algorithm. We also need to check ops continue normally after the pod is killed and restarted elsewhere, including for example how we handle commits to the Kafka offset.

Likely a combination of understand, document, improve.

I'll modify the title

planetf1 avatar Aug 24 '22 13:08 planetf1