datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Use bigger fixture tree for distributed tests

Open shcheklein opened this issue 1 year ago • 8 comments

A follow up from https://github.com/iterative/studio/pull/10211

Since the default tree fixture is too small UDFDistributor was smart enough to run a single worker (despite us passing 2 as UDF parameter in tests).

I don't know if that was always the case or not, but atm it means we are testing some edge case (a single worker running as part of the LocalWorkerProcess I believe) and don't really launch other workers (that takes a different path).

Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).

Thanks @dtulga for pointing me to these tests.

shcheklein avatar Jul 10 '24 22:07 shcheklein

Originally this wasn't only running a local worker

It seems it's non trivial actually to run the second worker properly, since it requires a celery worker running and I don't see a fixture for that or something. Or were you using a different way of running it? (don't spend too much time on this, but if you remember / could find the initial way we were running the second worker - that might speedup my research)

Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).

Re this. I think I found the root cause for this and potentially we need to check some other places.

It hangs on worker_result_group.join() in shutdown_datachain_workers. Why? Because, apparently worker_result_group.revoke() is not enough - https://github.com/celery/celery/issues/8888 😢 (also this describes it well https://stackoverflow.com/questions/39191238/revoke-a-task-from-celery) . So, even if put timeout, etc, if workers are offline (and this case there were no celery node to run the worker_result_group in the first place - we don't launch them, and on Studio there was a queue name discrepancy) we still won't remove the task from the queue and it might run again of clutter the queue. Not sure if there is a simple workaround ... looking more into that.

shcheklein avatar Jul 10 '24 23:07 shcheklein

Originally, I was using this script: https://github.com/iterative/dvcx-server/blob/b232559d773dcee8cadc9f1ac8730c0856b94ff8/clickhouse-db-adapter/scripts/run_with_distributed_workers.py to run the tests with distributed workers, but this has been changed a few times since then.

Now they are supposed to be run with this fixture here: https://github.com/iterative/studio/blob/v2.126.1/backend/clickhouse-db-adapter/tests/conftest.py#L18 but I'm not sure how this is supposed to work with these tests, this fixture was added in this PR: https://github.com/iterative/dvcx-server/pull/332 and that also removed the usage of run_with_distributed_workers.py for tests.

And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).

dtulga avatar Jul 11 '24 00:07 dtulga

Originally, I was using this script

Yep, I saw the script ... I thought it was more of a local helper. Good to have more context. Thanks. Assuming the fixture works - can we drop the script or do you use it locally also?

And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).

yep, I don't see an easy fix for this so far. I'll keep looking for a while.

shcheklein avatar Jul 11 '24 00:07 shcheklein

I have been using that script for local debugging, yes. And I don't see an obvious fix for this particular kind of Celery issue either, but I'll think of possible solutions as well.

dtulga avatar Jul 11 '24 01:07 dtulga

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.21%. Comparing base (baf5e9f) to head (c100914). Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #4      +/-   ##
==========================================
+ Coverage   88.18%   88.21%   +0.02%     
==========================================
  Files         133      133              
  Lines       12166    12157       -9     
  Branches     1699     1699              
==========================================
- Hits        10729    10724       -5     
+ Misses       1018     1016       -2     
+ Partials      419      417       -2     
Flag Coverage Δ
datachain 88.13% <ø> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 15 '24 19:07 codecov[bot]

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c100914
Status: ✅  Deploy successful!
Preview URL: https://25a2678b.datachain-documentation.pages.dev
Branch Preview URL: https://fix-distributed-test.datachain-documentation.pages.dev

View logs

If you ever get back to this can you please move the tests from DatasetQuery to DataChain

mattseddon avatar Sep 12 '24 06:09 mattseddon

I'll suggest closing this for now, and reopening when you get back to this. It has been opened for quite a while now.

skshetry avatar Sep 25 '24 03:09 skshetry

Please reopen this when you get back to it. I am closing for now. :)

skshetry avatar Mar 11 '25 16:03 skshetry

@skshetry that's fine. There are some important insights (it would take some time to find this for someone who is trying to fix it), I hope people will find it.

shcheklein avatar Mar 11 '25 16:03 shcheklein

Sorry for all the force-push noise above: I found a bug in GitHub where it wasn't picking up any of the commits I pushed, and it was closing the pull request with 0 commits.

0x2b3bfa0 avatar Mar 23 '25 21:03 0x2b3bfa0

Extracted the worker fixture to a separate pull request (https://github.com/iterative/datachain/pull/990) because they're completely unrelated and independently useful changes.

0x2b3bfa0 avatar Mar 24 '25 00:03 0x2b3bfa0