core icon indicating copy to clipboard operation
core copied to clipboard

Processor multi output

Open bertsky opened this issue 2 weeks ago • 14 comments

bertsky avatar Dec 10 '25 16:12 bertsky

I don't understand the failing network-integration-test. The only thing that stands out is this

ocrd_network_processing_server  | 15:08:52.248 DEBUG pymongo.command - {"message": "Command started", "clientId": {"$oid": "693c30045b7149e3475cde24"}, "command": "{\"listIndexes\": \"DBWorkflowScript\", \"lsid\": {\"id\": {\"$binary\": {\"base64\": \"t2xtJH5rR+qQIFoGN7Ng1A==\", \"subType\": \"04\"}}}, \"$db\": \"ocrd\"}", "commandName": "listIndexes", "databaseName": "ocrd", "requestId": 304089172, "operationId": 304089172, "driverConnectionId": 1, "serverConnectionId": 14, "serverHost": "mongodb-docker-host.ocrd_network_test", "serverPort": 27017}
ocrd_network_processing_server  | 15:08:52.249 DEBUG pymongo.command - {"message": "Command failed", "clientId": {"$oid": "693c30045b7149e3475cde24"}, "durationMS": 0.717, "failure": "{\"code\": 26, \"codeName\": \"NamespaceNotFound\"}", "commandName": "listIndexes", "databaseName": "ocrd", "requestId": 304089172, "operationId": 304089172, "driverConnectionId": 1, "serverConnectionId": 14, "serverHost": "mongodb-docker-host.ocrd_network_test", "serverPort": 27017}

Any ideas, @MehmedGIT perhaps?

bertsky avatar Dec 12 '25 16:12 bertsky

ocrd_network_rabbit_mq          | 2025-12-16 11:48:57.614805+00:00 [info] <0.234.0>  Starting RabbitMQ 3.12.14 on Erlang 25.3.2.15 [jit]
ocrd_network_rabbit_mq          | 2025-12-16 11:48:57.614805+00:00 [info] <0.234.0>  Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries
ocrd_network_rabbit_mq          | 2025-12-16 11:48:57.614805+00:00 [info] <0.234.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com
Error: twork_rabbit_mq          | 2025-12-16 11:48:57.614881+00:00 [error] <0.234.0> This release series has reached end of life and is no longer supported. Please visit https://rabbitmq.com/versions.html to learn more and upgrade

@MehmedGIT How compatible with current versions of RabbitMQ are we? I have been using 16.0.84.1.3 for a long time and will need to update because of CVEs etc soon.

But that isn't the reason for the test failure - what actually is? Could not determine that quickly from the logs 😕

kba avatar Dec 16 '25 13:12 kba

But that isn't the reason for the test failure - what actually is? Could not determine that quickly from the logs 😕

See above

Happens for

  • motor==3.2.0 and 3.6.1 and 3.7.1
  • beanie==1.29.0 and 1.30.0
  • pydantic==2.10.6 and 2.12.5
  • pymongo==4.4.0 and 4.9.2 and 4.15.5

bertsky avatar Dec 16 '25 16:12 bertsky

@bertsky, seems all ubuntu-22.04 tests are passing currently. I have manually restarted the CI/CD pipeline for that failing test and it runs fine.

All tests of core v3.8.1 (+ ocrd_network module and integration tests) are also passing on my local machine Ubuntu 24.04 with Python 3.12.3. I can no longer produce the Pydantic and MongoDB issues.

The package versions I have in the environment are: beanie==1.30.0 motor==3.7.1 pydantic==2.12.5 pydantic_core==2.41.5 pymongo==4.15.5

MehmedGIT avatar Dec 16 '25 21:12 MehmedGIT

@MehmedGIT wow, indeed! So is that the result of https://github.com/OCR-D/core/pull/1344/commits/53b8e2bbfe6265f4315d4a08b42c67a1be87d7fd actually?

I am stunned, as I would have expected this to fail in the normal CI already. And in the network test, why don't we see any concrete error message related to the broken exception f-string?

bertsky avatar Dec 16 '25 21:12 bertsky

@MehmedGIT wow, indeed! So is that the result of https://github.com/OCR-D/core/commit/53b8e2bbfe6265f4315d4a08b42c67a1be87d7fd actually?

Not sure, seems so. I thought it could be a deadlock somewhere but I could not identify any change that could cause that. However, the job failure here is not because of a timeout. The job itself has failed for whatever reason and returned FAILED not UNSET, check here, as it should for a timeout. To nail down why it failed, the processing job log inside the runner VM should have the details. Located by default under $TMP/ocrd_network_logs/processing_jobs/$JOB_ID. It may be a good idea to print the content of the log file in the test itself.

why don't we see any concrete error message related to the broken exception f-string?

The logs are redirected to the log file. No exception is raised on the processing server level.

EDIT: Actually, we don't even need to know where are the logs are stored. The client can just request the log file from the processing server and print its content if the job has FAILED. I will create a PR.

MehmedGIT avatar Dec 16 '25 22:12 MehmedGIT

@bertsky, https://github.com/OCR-D/core/pull/1348 this should help for future reference when integration tests are failing.

MehmedGIT avatar Dec 17 '25 02:12 MehmedGIT

However, the job failure here is not because of a timeout. The job itself has failed for whatever reason and returned FAILED not UNSET, check here, as it should for a timeout. To nail down why it failed, the processing job log inside the runner VM should have the details. Located by default under $TMP/ocrd_network_logs/processing_jobs/$JOB_ID. It may be a good idea to print the content of the log file in the test itself.

Ee! So I was completely misled by the pymongo messages – they were actually harmless of themselves! I did not notice that we did not even show the actual job logs in the test output.

#1348 this should help for future reference when integration tests are failing.

Fantastic – thanks @MehmedGIT !

bertsky avatar Dec 17 '25 02:12 bertsky

Ee! So I was completely misled by the pymongo messages – they were actually harmless of themselves!

Yes, they are. RabbitMQ ones are usually also harmless. I actually find the outputs of these services to be cluttering the logs when something fails. At least when running the tests locally. Would be good if somehow they could be disabled while keeping the outputs of the other relevant test containers.

MehmedGIT avatar Dec 17 '25 02:12 MehmedGIT

Ee! So I was completely misled by the pymongo messages – they were actually harmless of themselves!

Yes, they are.

I have a hard time reproducing this locally. I tried reverting https://github.com/OCR-D/core/commit/53b8e2bbfe6265f4315d4a08b42c67a1be87d7fd and then merged in #1348 to see if I get more useful error logs. But it's the exact same picture: the docker exec core_test pytest stuff never even runs, because the PS already fails at startup. And the only error it shows is the NamespaceNotFound thing with MongoDB...

bertsky avatar Dec 17 '25 11:12 bertsky

I have a hard time reproducing this locally. I tried reverting 53b8e2b and then merged in #1348 to see if I get more useful error logs. But it's the exact same picture: the docker exec core_test pytest stuff never even runs, because the PS already fails at startup. And the only error it shows is the NamespaceNotFound thing with MongoDB...

Update: the problem for me was the wrongly configured healthcheck.test in tests/network/docker-compose.yml – it uses the external $OCRD_PS_URL/ on the PS instead of http://localhost:8000/, hence I never even got to the pytest part.

bertsky avatar Dec 17 '25 12:12 bertsky

I have a hard time reproducing this locally. I tried reverting 53b8e2b and then merged in #1348 to see if I get more useful error logs. But it's the exact same picture: the docker exec core_test pytest stuff never even runs, because the PS already fails at startup. And the only error it shows is the NamespaceNotFound thing with MongoDB...

Update: the problem for me was the wrongly configured healthcheck.test in tests/network/docker-compose.yml – it uses the external $OCRD_PS_URL/ on the PS instead of http://localhost:8000/, hence I never even got to the pytest part.

~The tests running in a docker container cannot access the PS with http://localhost:8000/. If that's not breaking it will break when the workers started in docker try to send status results via the internal callback mechanism - docker to docker communication.~ that statement is wrong.

I remember that we had a difference between the workers running locally and in docker environment. Hence, we had this mechanism to switch between the two, based on whether the internal_callback_url was provided in the ps_config.yml (for workers in docker) or not.

Maybe @joschrew remembers more since he was testing the workers in the docker environment with his setup.

MehmedGIT avatar Dec 17 '25 12:12 MehmedGIT

Update: the problem for me was the wrongly configured healthcheck.test in tests/network/docker-compose.yml – it uses the external $OCRD_PS_URL/ on the PS instead of http://localhost:8000/, hence I never even got to the pytest part.

I have pushed the fix change to #1348

MehmedGIT avatar Dec 17 '25 13:12 MehmedGIT

@kba I added a unit test to cover directly what I previously missed (and thus ended up problematic in network-integration-test).

So all done now.

bertsky avatar Dec 17 '25 14:12 bertsky