datachain
datachain copied to clipboard
Use bigger fixture tree for distributed tests
A follow up from https://github.com/iterative/studio/pull/10211
Since the default tree fixture is too small UDFDistributor was smart enough to run a single worker (despite us passing 2 as UDF parameter in tests).
I don't know if that was always the case or not, but atm it means we are testing some edge case (a single worker running as part of the LocalWorkerProcess I believe) and don't really launch other workers (that takes a different path).
Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).
Thanks @dtulga for pointing me to these tests.
Originally this wasn't only running a local worker
It seems it's non trivial actually to run the second worker properly, since it requires a celery worker running and I don't see a fixture for that or something. Or were you using a different way of running it? (don't spend too much time on this, but if you remember / could find the initial way we were running the second worker - that might speedup my research)
Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).
Re this. I think I found the root cause for this and potentially we need to check some other places.
It hangs on worker_result_group.join() in shutdown_datachain_workers. Why? Because, apparently worker_result_group.revoke() is not enough - https://github.com/celery/celery/issues/8888 😢 (also this describes it well https://stackoverflow.com/questions/39191238/revoke-a-task-from-celery) . So, even if put timeout, etc, if workers are offline (and this case there were no celery node to run the worker_result_group in the first place - we don't launch them, and on Studio there was a queue name discrepancy) we still won't remove the task from the queue and it might run again of clutter the queue. Not sure if there is a simple workaround ... looking more into that.
Originally, I was using this script: https://github.com/iterative/dvcx-server/blob/b232559d773dcee8cadc9f1ac8730c0856b94ff8/clickhouse-db-adapter/scripts/run_with_distributed_workers.py to run the tests with distributed workers, but this has been changed a few times since then.
Now they are supposed to be run with this fixture here: https://github.com/iterative/studio/blob/v2.126.1/backend/clickhouse-db-adapter/tests/conftest.py#L18 but I'm not sure how this is supposed to work with these tests, this fixture was added in this PR: https://github.com/iterative/dvcx-server/pull/332 and that also removed the usage of run_with_distributed_workers.py for tests.
And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).
Originally, I was using this script
Yep, I saw the script ... I thought it was more of a local helper. Good to have more context. Thanks. Assuming the fixture works - can we drop the script or do you use it locally also?
And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).
yep, I don't see an easy fix for this so far. I'll keep looking for a while.
I have been using that script for local debugging, yes. And I don't see an obvious fix for this particular kind of Celery issue either, but I'll think of possible solutions as well.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 88.21%. Comparing base (
baf5e9f) to head (c100914). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #4 +/- ##
==========================================
+ Coverage 88.18% 88.21% +0.02%
==========================================
Files 133 133
Lines 12166 12157 -9
Branches 1699 1699
==========================================
- Hits 10729 10724 -5
+ Misses 1018 1016 -2
+ Partials 419 417 -2
| Flag | Coverage Δ | |
|---|---|---|
| datachain | 88.13% <ø> (+0.02%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Deploying datachain-documentation with
Cloudflare Pages
| Latest commit: |
c100914
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://25a2678b.datachain-documentation.pages.dev |
| Branch Preview URL: | https://fix-distributed-test.datachain-documentation.pages.dev |
If you ever get back to this can you please move the tests from DatasetQuery to DataChain
I'll suggest closing this for now, and reopening when you get back to this. It has been opened for quite a while now.
Please reopen this when you get back to it. I am closing for now. :)
@skshetry that's fine. There are some important insights (it would take some time to find this for someone who is trying to fix it), I hope people will find it.
Sorry for all the force-push noise above: I found a bug in GitHub where it wasn't picking up any of the commits I pushed, and it was closing the pull request with 0 commits.
Extracted the worker fixture to a separate pull request (https://github.com/iterative/datachain/pull/990) because they're completely unrelated and independently useful changes.