ocis icon indicating copy to clipboard operation
ocis copied to clipboard

OCIS Server Startup Delay & NATS Service Issues

Open vishalaswanispark opened this issue 9 months ago • 6 comments

Hello Team,

I have a question regarding the OwnCloud Infinite Scale (OCIS) server and would appreciate some guidance.

Below are details about our environment.

Environment Details:

  • We are running OCIS server 6.6.1 as a Docker container in a Docker Swarm.
  • Authentication is handled through Keycloak, and the OCIS URL is exposed via Traefik, running in the Swarm manager.
  • We have multiple OCIS server containers in our Swarm.
  • For each container:
  • OCIS configuration and metadata are stored on a POSIX device.
  • Data blobs are stored in an S3 bucket.

Issue Description: We have one container that has been heavily used in the past few months. Recently, we started experiencing performance issues:

1. Slow Startup Time

  • When restarting the container, it takes around 10 minutes to complete the server startup.
  • Based on the logs, the delay is due to message recovery.

Logs from the startup process:

2025-01-27T10:53:04Z WRN Filestore [KV_activitylog] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T10:58:34Z INF Restored 751,369 messages for stream '$G > KV_activitylog' in 5m30.377s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T10:58:35Z WRN Filestore [KV_eventhistory] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T11:00:17Z INF Restored 8,969,992 messages for stream '$G > KV_eventhistory' in 1m43.399s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:19Z WRN Filestore [KV_postprocessing] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T11:00:36Z INF Restored 3,760,786 messages for stream '$G > KV_postprocessing' in 18.421s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:36Z INF Restored 24,409,270 messages for stream '$G > main-queue' in 20ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:36Z INF Recovering 11 consumers for stream - '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats

From the logs, message restoration alone took approximately 8 minutes [5m30.377s + 1m43.399s + 18.421s + 20ms].

To mitigate the issue, we stopped the container and removed all files from /var/lib/ocis/nats/jetstream/$G/streams/KV_activitylog.

After restarting the container, the startup time significantly improved to about a minute.

2025-03-03T15:04:12Z WRN Filestore [KV_activitylog] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:19Z INF Restored 82,344 messages for stream '$G > KV_activitylog' in 7.36s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:19Z WRN Filestore [KV_eventhistory] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:19Z INF Restored 44,432 messages for stream '$G > KV_eventhistory' in 190ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:20Z WRN Filestore [KV_postprocessing] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:38Z INF Restored 4,731,315 messages for stream '$G > KV_postprocessing' in 18.744s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:38Z WRN Filestore [main-queue] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:05:03Z INF Restored 29,812,809 messages for stream '$G > main-queue' in 24.973s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:05:03Z INF Recovering 11 consumers for stream - '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats

This suggests that message backlog is significantly impacting startup time.

Below are the current storage sizes of different NATS streams under /var/lib/ocis/nats/jetstream/$G/streams/:

# du -sh *
2.2G	KV_activitylog
51M	KV_eventhistory
781M	KV_postprocessing
33G	main-queue

The main-queue has grown to 33GB, which seems excessive and may contribute to performance issues.

2. Slow Consumer Alerts Additionally, we are continuously seeing the following alert in our logs:

nats: slow consumer, messages dropped on connection [60] for subscription on "main-queue"

This suggests potential message processing delays or overload in the main-queue.

Questions & Request for Guidance:

  • Is it possible to disable these services (e.g., KV_activitylog, KV_eventhistory, main-queue) if they are not required?
  • If disabling is not an option, can we configure these services to automatically clear messages after a certain period instead of storing them indefinitely?
  • Are there any recommended optimizations to prevent such large message backlogs and improve startup performance?
  • How do we fix the NATS slow consumer and message drop situation?

Any insights or recommendations would be greatly appreciated

Thank you in advance.

Regards, Vishal

vishalaswanispark avatar Mar 05 '25 13:03 vishalaswanispark

You could try adjusting the OCIS_PERSISTENT_STORE_TTL. The default value is 336h (2 weeks)

2403905 avatar Mar 07 '25 15:03 2403905

thanks @2403905 , we will test the TTL parameter and post the results.

vishalaswanispark avatar Mar 11 '25 18:03 vishalaswanispark

@vishalaswanispark Is it work for you? Can I close an issue?

2403905 avatar Mar 17 '25 09:03 2403905

@2403905 Thanks for following up. We have implemented the change and are currently monitoring our containers to assess its impact.

OCIS_PERSISTENT_STORE_TTL: "168h"

We kindly request that you reopen the issue for a couple of weeks to allow for further observation.

vishalaswanispark avatar Mar 17 '25 13:03 vishalaswanispark

Hi team,

After introducing OCIS_PERSISTENT_STORE_TTL: "168h", we restarted the container after a period of 9 days. Below are the observations:

18 March 2025

2025-03-18T12:15:26Z INF   Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF   Restored 137,693 messages for stream '$G > KV_activitylog' in 1m47.763s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF   Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF   Restored 44,379 messages for stream '$G > KV_eventhistory' in 202ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF   Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:33Z INF   Restored 4,731,429 messages for stream '$G > KV_postprocessing' in 18.906s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:33Z INF   Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:20:53Z INF   Restored 29,859,800 messages for stream '$G > main-queue' in 3m19.851s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats

27 March 2025

2025-03-27T11:10:44Z INF   Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF   Restored 174,595 messages for stream '$G > KV_activitylog' in 3m53.386s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF   Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF   Restored 45,179 messages for stream '$G > KV_eventhistory' in 277ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF   Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:56Z INF   Restored 4,731,429 messages for stream '$G > KV_postprocessing' in 18.251s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:56Z INF   Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:15:49Z INF   Restored 29,886,168 messages for stream '$G > main-queue' in 52.863s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats

Observations

  • The message restore time remains approximately 5 minutes in both cases, which affects overall OCIS server startup time.
  • The number of messages restored has slightly increased over time.
Service 18 Mar 2025 27 Mar 2025
KV_activitylog 137,693 174,595
KV_eventhistory 44,379 45,179
KV_postprocessing 4,731,429 4,731,429
main-queue 29,859,800 29,886,168

Kindly suggest how we can control these messages and improve the OCIS service startup time.

vishalaswanispark avatar Mar 27 '25 11:03 vishalaswanispark

I can observe the same startup behavior on my OCIS 7.1.1 deployment.

We made the same change, OCIS_PERSISTENT_STORE_TTL: "168h", and observe no change in the message restore time impacting OCIS service startup.

We were able to improve startup time slightly by moving NATS off HDD SAN storage onto SSD storage but we feel this is just a stopgap.

bitbytenybble110 avatar Mar 27 '25 14:03 bitbytenybble110

@kobergj Could you take a look at the issue?

2403905 avatar Apr 01 '25 17:04 2403905

hi @kobergj , could you please take a look and help us with this issue? Our customer is consistently running into it every time their OCIS container restarts.

Although we've set a memory limit of 32 GB for the container, when the customer tries to sync a large number of files, the container consumes all available memory, triggering a restart by Docker Swarm. After each restart, there's a delay of around 10 minutes while OCIS restores messages during startup.

Below are the recent logs for your reference.

2025-04-07T18:50:17Z INF   Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:35Z INF   Restored 219,652 messages for stream '$G > KV_activitylog' in 8m17.54s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:35Z INF   Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:43Z INF   Restored 698,704 messages for stream '$G > KV_eventhistory' in 8.014s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:43Z INF   Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:59:04Z INF   Restored 4,849,476 messages for stream '$G > KV_postprocessing' in 20.634s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:59:04Z INF   Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T19:03:30Z INF   Restored 30,577,146 messages for stream '$G > main-queue' in 4m26.053s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats

vishalaswanispark avatar Apr 07 '25 19:04 vishalaswanispark

hi @kobergj @2403905 can anyone please have a look and suggest how we can improve the OCIS server startup time in our case.

Thanks. Vishal

vishalaswanispark avatar May 05 '25 16:05 vishalaswanispark

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 18 '25 16:07 stale[bot]

hi @kobergj @2403905 can anyone please have a look and suggest how we can improve the OCIS server startup time in our case.

Thanks. Vishal

@vishalaswanispark Sorry for the delay. The only idea is to run the NATS apart from OCIS. You could try to add another service in the ocis docker stack. For binary I run two servers: First the nats and then the rest

./ocis/bin/ocis nats server
OCIS_EXCLUDE_RUN_SERVICES=nats ./ocis/bin/ocis server

2403905 avatar Jul 23 '25 11:07 2403905

Or you could try to deploy the nats some other way, on baremetal or like a pure nats container like in that case https://github.com/owncloud/ocis-charts/blob/main/deployments/ocis-nats/helmfile.yaml#L6

2403905 avatar Jul 23 '25 12:07 2403905

@vishalaswanispark Here is an example of how to extend the Docker Compose deployment https://github.com/owncloud/ocis/blob/master/deployments/examples/ocis_full/ocis.yml

services:

  ocis:
    environment:
      # additional configuration 
      OCIS_EXCLUDE_RUN_SERVICES: nats
      OCIS_ADD_RUN_SERVICES: auth-app
      PROXY_ENABLE_APP_AUTH: true
      MICRO_REGISTRY_ADDRESS: nats:9233
      OCIS_PERSISTENT_STORE_NODES: nats:9233
      OCIS_CACHE_STORE_NODES: nats:9233
      OCIS_EVENTS_ENDPOINT: nats:9233
      NATS_NATS_HOST: nats
      NATS_NATS_PORT: 9233


  nats:
    image: ${OCIS_DOCKER_IMAGE:-owncloud/ocis}:${OCIS_DOCKER_TAG:-latest}
    # changelog: https://github.com/owncloud/ocis/tree/master/changelog
    # release notes: https://doc.owncloud.com/ocis_release_notes.html
    networks:
      ocis-net:
    entrypoint:
      - /bin/sh
    command: ["-c", "ocis nats server" ]
    environment:
      OCIS_LOG_LEVEL: ${LOG_LEVEL:-info}
      # enable services that are not started automatically
      NATS_NATS_HOST: nats
      NATS_NATS_PORT: 9233
    volumes:
      - ${OCIS_CONFIG_DIR:-ocis-config}:/etc/ocis
      - ${OCIS_DATA_DIR:-ocis-data}:/var/lib/ocis
    logging:
      driver: ${LOG_DRIVER:-local}
    restart: always

2403905 avatar Jul 23 '25 16:07 2403905