OCIS Server Startup Delay & NATS Service Issues
Hello Team,
I have a question regarding the OwnCloud Infinite Scale (OCIS) server and would appreciate some guidance.
Below are details about our environment.
Environment Details:
- We are running OCIS server 6.6.1 as a Docker container in a Docker Swarm.
- Authentication is handled through Keycloak, and the OCIS URL is exposed via Traefik, running in the Swarm manager.
- We have multiple OCIS server containers in our Swarm.
- For each container:
- OCIS configuration and metadata are stored on a POSIX device.
- Data blobs are stored in an S3 bucket.
Issue Description: We have one container that has been heavily used in the past few months. Recently, we started experiencing performance issues:
1. Slow Startup Time
- When restarting the container, it takes around 10 minutes to complete the server startup.
- Based on the logs, the delay is due to message recovery.
Logs from the startup process:
2025-01-27T10:53:04Z WRN Filestore [KV_activitylog] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T10:58:34Z INF Restored 751,369 messages for stream '$G > KV_activitylog' in 5m30.377s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T10:58:35Z WRN Filestore [KV_eventhistory] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T11:00:17Z INF Restored 8,969,992 messages for stream '$G > KV_eventhistory' in 1m43.399s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:19Z WRN Filestore [KV_postprocessing] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-01-27T11:00:36Z INF Restored 3,760,786 messages for stream '$G > KV_postprocessing' in 18.421s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:36Z INF Restored 24,409,270 messages for stream '$G > main-queue' in 20ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-01-27T11:00:36Z INF Recovering 11 consumers for stream - '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
From the logs, message restoration alone took approximately 8 minutes [5m30.377s + 1m43.399s + 18.421s + 20ms].
To mitigate the issue, we stopped the container and removed all files from /var/lib/ocis/nats/jetstream/$G/streams/KV_activitylog.
After restarting the container, the startup time significantly improved to about a minute.
2025-03-03T15:04:12Z WRN Filestore [KV_activitylog] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:19Z INF Restored 82,344 messages for stream '$G > KV_activitylog' in 7.36s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:19Z WRN Filestore [KV_eventhistory] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:19Z INF Restored 44,432 messages for stream '$G > KV_eventhistory' in 190ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:20Z WRN Filestore [KV_postprocessing] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:04:38Z INF Restored 4,731,315 messages for stream '$G > KV_postprocessing' in 18.744s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:04:38Z WRN Filestore [main-queue] Recovering stream state from index errored: prior state file line=/ocis/services/nats/pkg/logging/nats.go:27 service=nats
2025-03-03T15:05:03Z INF Restored 29,812,809 messages for stream '$G > main-queue' in 24.973s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-03T15:05:03Z INF Recovering 11 consumers for stream - '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
This suggests that message backlog is significantly impacting startup time.
Below are the current storage sizes of different NATS streams under /var/lib/ocis/nats/jetstream/$G/streams/:
# du -sh *
2.2G KV_activitylog
51M KV_eventhistory
781M KV_postprocessing
33G main-queue
The main-queue has grown to 33GB, which seems excessive and may contribute to performance issues.
2. Slow Consumer Alerts Additionally, we are continuously seeing the following alert in our logs:
nats: slow consumer, messages dropped on connection [60] for subscription on "main-queue"
This suggests potential message processing delays or overload in the main-queue.
Questions & Request for Guidance:
- Is it possible to disable these services (e.g., KV_activitylog, KV_eventhistory, main-queue) if they are not required?
- If disabling is not an option, can we configure these services to automatically clear messages after a certain period instead of storing them indefinitely?
- Are there any recommended optimizations to prevent such large message backlogs and improve startup performance?
- How do we fix the NATS slow consumer and message drop situation?
Any insights or recommendations would be greatly appreciated
Thank you in advance.
Regards, Vishal
You could try adjusting the OCIS_PERSISTENT_STORE_TTL. The default value is 336h (2 weeks)
thanks @2403905 , we will test the TTL parameter and post the results.
@vishalaswanispark Is it work for you? Can I close an issue?
@2403905 Thanks for following up. We have implemented the change and are currently monitoring our containers to assess its impact.
OCIS_PERSISTENT_STORE_TTL: "168h"
We kindly request that you reopen the issue for a couple of weeks to allow for further observation.
Hi team,
After introducing OCIS_PERSISTENT_STORE_TTL: "168h", we restarted the container after a period of 9 days. Below are the observations:
18 March 2025
2025-03-18T12:15:26Z INF Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF Restored 137,693 messages for stream '$G > KV_activitylog' in 1m47.763s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF Restored 44,379 messages for stream '$G > KV_eventhistory' in 202ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:14Z INF Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:33Z INF Restored 4,731,429 messages for stream '$G > KV_postprocessing' in 18.906s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:17:33Z INF Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-18T12:20:53Z INF Restored 29,859,800 messages for stream '$G > main-queue' in 3m19.851s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
27 March 2025
2025-03-27T11:10:44Z INF Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF Restored 174,595 messages for stream '$G > KV_activitylog' in 3m53.386s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF Restored 45,179 messages for stream '$G > KV_eventhistory' in 277ms line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:37Z INF Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:56Z INF Restored 4,731,429 messages for stream '$G > KV_postprocessing' in 18.251s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:14:56Z INF Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-03-27T11:15:49Z INF Restored 29,886,168 messages for stream '$G > main-queue' in 52.863s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
Observations
- The message restore time remains approximately 5 minutes in both cases, which affects overall OCIS server startup time.
- The number of messages restored has slightly increased over time.
| Service | 18 Mar 2025 | 27 Mar 2025 |
|---|---|---|
| KV_activitylog | 137,693 | 174,595 |
| KV_eventhistory | 44,379 | 45,179 |
| KV_postprocessing | 4,731,429 | 4,731,429 |
| main-queue | 29,859,800 | 29,886,168 |
Kindly suggest how we can control these messages and improve the OCIS service startup time.
I can observe the same startup behavior on my OCIS 7.1.1 deployment.
We made the same change, OCIS_PERSISTENT_STORE_TTL: "168h", and observe no change in the message restore time impacting OCIS service startup.
We were able to improve startup time slightly by moving NATS off HDD SAN storage onto SSD storage but we feel this is just a stopgap.
@kobergj Could you take a look at the issue?
hi @kobergj , could you please take a look and help us with this issue? Our customer is consistently running into it every time their OCIS container restarts.
Although we've set a memory limit of 32 GB for the container, when the customer tries to sync a large number of files, the container consumes all available memory, triggering a restart by Docker Swarm. After each restart, there's a delay of around 10 minutes while OCIS restores messages during startup.
Below are the recent logs for your reference.
2025-04-07T18:50:17Z INF Starting restore for stream '$G > KV_activitylog' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:35Z INF Restored 219,652 messages for stream '$G > KV_activitylog' in 8m17.54s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:35Z INF Starting restore for stream '$G > KV_eventhistory' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:43Z INF Restored 698,704 messages for stream '$G > KV_eventhistory' in 8.014s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:58:43Z INF Starting restore for stream '$G > KV_postprocessing' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:59:04Z INF Restored 4,849,476 messages for stream '$G > KV_postprocessing' in 20.634s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T18:59:04Z INF Starting restore for stream '$G > main-queue' line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
2025-04-07T19:03:30Z INF Restored 30,577,146 messages for stream '$G > main-queue' in 4m26.053s line=/ocis/services/nats/pkg/logging/nats.go:21 service=nats
hi @kobergj @2403905 can anyone please have a look and suggest how we can improve the OCIS server startup time in our case.
Thanks. Vishal
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.
hi @kobergj @2403905 can anyone please have a look and suggest how we can improve the OCIS server startup time in our case.
Thanks. Vishal
@vishalaswanispark Sorry for the delay. The only idea is to run the NATS apart from OCIS. You could try to add another service in the ocis docker stack. For binary I run two servers: First the nats and then the rest
./ocis/bin/ocis nats server
OCIS_EXCLUDE_RUN_SERVICES=nats ./ocis/bin/ocis server
Or you could try to deploy the nats some other way, on baremetal or like a pure nats container like in that case https://github.com/owncloud/ocis-charts/blob/main/deployments/ocis-nats/helmfile.yaml#L6
@vishalaswanispark Here is an example of how to extend the Docker Compose deployment https://github.com/owncloud/ocis/blob/master/deployments/examples/ocis_full/ocis.yml
services:
ocis:
environment:
# additional configuration
OCIS_EXCLUDE_RUN_SERVICES: nats
OCIS_ADD_RUN_SERVICES: auth-app
PROXY_ENABLE_APP_AUTH: true
MICRO_REGISTRY_ADDRESS: nats:9233
OCIS_PERSISTENT_STORE_NODES: nats:9233
OCIS_CACHE_STORE_NODES: nats:9233
OCIS_EVENTS_ENDPOINT: nats:9233
NATS_NATS_HOST: nats
NATS_NATS_PORT: 9233
nats:
image: ${OCIS_DOCKER_IMAGE:-owncloud/ocis}:${OCIS_DOCKER_TAG:-latest}
# changelog: https://github.com/owncloud/ocis/tree/master/changelog
# release notes: https://doc.owncloud.com/ocis_release_notes.html
networks:
ocis-net:
entrypoint:
- /bin/sh
command: ["-c", "ocis nats server" ]
environment:
OCIS_LOG_LEVEL: ${LOG_LEVEL:-info}
# enable services that are not started automatically
NATS_NATS_HOST: nats
NATS_NATS_PORT: 9233
volumes:
- ${OCIS_CONFIG_DIR:-ocis-config}:/etc/ocis
- ${OCIS_DATA_DIR:-ocis-data}:/var/lib/ocis
logging:
driver: ${LOG_DRIVER:-local}
restart: always