dcache BulkRequestNotFoundException warning appearing for auto-cleared bulk requests create large logs for bulkDomain

Dear dCache developers,

We at KIT encounter from time to time very large logs of several GB appearing on our productive systems for the bulk domain.

When looking into the logs, we see the following warnings appearing very often:

05 Nov 2025 13:52:29 (bulk) [] checkTerminated, check for cancel on failure 879e460c-64c9-43ad-8098-3a79dd777275: org.dcache.services.bulk.BulkRequestNotFoundException: request 879e460c-64c9-43ad-8098-3a79dd777275 not found.
05 Nov 2025 13:52:31 (bulk) [] checkTerminated, check for cancel on failure a2f74bb4-2506-4c90-abd0-19b42f41fe29: org.dcache.services.bulk.BulkRequestNotFoundException: request a2f74bb4-2506-4c90-abd0-19b42f41fe29 not found.
05 Nov 2025 13:52:31 (bulk) [] checkTerminated, check for cancel on failure 28d5c7bc-5cb4-4500-b1f3-98bf4d0658ad: org.dcache.services.bulk.BulkRequestNotFoundException: request 28d5c7bc-5cb4-4500-b1f3-98bf4d0658ad not found.
05 Nov 2025 13:52:32 (bulk) [] checkTerminated, check for cancel on failure c6c86170-81f4-4d66-a5d6-fa7e91aa93af: org.dcache.services.bulk.BulkRequestNotFoundException: request c6c86170-81f4-4d66-a5d6-fa7e91aa93af not found.
05 Nov 2025 13:52:32 (bulk) [] checkTerminated, check for cancel on failure f1038556-957c-4d85-b575-0f0c86ffa2a9: org.dcache.services.bulk.BulkRequestNotFoundException: request f1038556-957c-4d85-b575-0f0c86ffa2a9 not found.
05 Nov 2025 13:52:33 (bulk) [] checkTerminated, check for cancel on failure da4b17e8-d7ee-4c5f-9c7a-71764c39d585: org.dcache.services.bulk.BulkRequestNotFoundException: request da4b17e8-d7ee-4c5f-9c7a-71764c39d585 not found.
05 Nov 2025 13:52:33 (bulk) [] checkTerminated, check for cancel on failure e5f35938-1f95-40e9-8719-4f0c99a075ec: org.dcache.services.bulk.BulkRequestNotFoundException: request e5f35938-1f95-40e9-8719-4f0c99a075ec not found.
05 Nov 2025 13:52:34 (bulk) [] checkTerminated, check for cancel on failure 19354e02-3681-4b2d-b91a-d306fb330108: org.dcache.services.bulk.BulkRequestNotFoundException: request 19354e02-3681-4b2d-b91a-d306fb330108 not found.

To analyze the situation, I have run my Github Copilot setup on a locally checked-out dCache codebase twice, each time with a different model. See the 2 files attached.

BULK_REQUEST_NOT_FOUND_RACE_CONDITION-by-Anthropic_Claude_Sonnet_4.5.md bulk-request-warning-analysis_by_OpenAI-GPT5-Codex.md

Both models conclude consistently, that there is a race condition between the check for the request being terminated, and a automatic clearing of bulk requests which have terminated.

They either suggest a hotfix to catch these messages or restructure the code to avoid this race condition. As always, AI-based analyses should be taken with care, but the conclusions sound plausible keeping the load we have in mind.

While this is not a critical issue in the first place, it might cause problems for us in case of heavy staging activity as done often by ATLAS. This gets even trickier on week-ends, where we don't pay full attention on the emails/messages about too big log files. For the particular case of the bulkDomain.log, it was close to filling up our space for logs, which is setup to provide 14GB space, despite a regular compression and rotation. The log file which we have rotated by hand was close to 1 GB. A few numbers on the appearances of the records in that log file:

> zgrep "bulk" /var/log/dcache/bulkDomain.log-20251102_2.gz | wc -l
61156408
> zgrep "checkTerminated" /var/log/dcache/bulkDomain.log-20251102_2.gz | wc -l
30592931
> zgrep "BulkRequestNotFoundException" /var/log/dcache/bulkDomain.log-20251102_2.gz | wc -l
61156244

I must admit though, that reaching 30 Mio entries for this warning is looking suspicious. It is not clear to me, whether all of them were triggered by the race condition encountered.

Additional note from us would be, that we would like to keep the log level at WARNING at KIT, if possible to keep track of other warnings which might be useful to debug problems.

@mksahakyan FYI (we discussed that in the T1 call)

Nov 05 '25 15:11 ArturAkh

Can you set log level to ERROR while we are looking at the issue?

Dec 09 '25 15:12 DmitryLitvintsev

Hi @DmitryLitvintsev,

No not really. This would remove all other useful entries in the logs, which we would like to keep.

We will "just" resolve the incidents on each occasion.

Dec 09 '25 15:12 ArturAkh

OK, I looked at the plan suggested by Anthropic_Claude_Sonne and it sees reasonable. I can make a patch and an RPM and pass it to you of you are willing to run it

Dec 09 '25 15:12 DmitryLitvintsev

Thanks, @DmitryLitvintsev!

I'll need to take a closer look, though, whether we can apply the patch easily in production. On test instances for sure, but it would be difficult to reproduce the problem there.

I can try though - we are currently running on 10.2.X so it would be best to switch our machine running the bulk domain to something similar.

Dec 09 '25 15:12 ArturAkh

You can deploy a patched RPM on the machine running bulk once it has cleared test instance

Dec 09 '25 16:12 DmitryLitvintsev

My Analysis is that is it not a race condition but a slight flaw in the logic. IOW it always produced the error if there is autoclear

I am incinded just to suppress this for now:

        } catch (BulkRequestNotFoundException e) {
          LOGGER.trace("setStateIfTerminated: {} already cleared; skipping post-processing.", id);
           return true;

As a "bug fix". And may be on the trunk, later, we will do something more sophisticated

Dec 09 '25 18:12 DmitryLitvintsev

here is patched 10.2 RPM:

https://drive.google.com/file/d/1kb1xoAmYPTkeTy6IH7S2uDifVT9bInOQ/view?usp=sharing

Dec 09 '25 21:12 DmitryLitvintsev

Thanks, @DmitryLitvintsev!

I'll have a look at the patch today.

Dec 10 '25 10:12 ArturAkh

We've deployed Dmitry's snapshot rpm in all our production dCache instances. We'll report back whether the situation has improved.

Dec 12 '25 10:12 XMol