tango Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace

🐛 Describe the bug

Run the catwalk training job specified here: https://github.com/allenai/catwalk/commit/5ba019204b0ff36c1c4da7feab4515342e9d9ad2

Command line is tango --settings experiments/train_all_the_things/tango.yml run experiments/train_all_the_things/train_all_the_things.jsonnet.

It will run for quite a while. Two jobs will usually fail. I don't know why it's always two.

When the run is complete, it tells you that two jobs failed, but you have no way of finding the logs for the failed jobs. (#387)
When you run it again, it finds all the succeeded jobs, as it should. The failed ones will start up again, but hang indefinitely somewhere in the Beaker workspace.
When you check your workspace, you will see hundreds of uncommitted datasets. Many of these contain actual, successful results. But two of them are the cause for the hang. It's impossible to know which ones to delete so I can finish my run. You can see many of these uncommitted datasets in the ai2/task-complexity workspace.

The actual problem here is the undiagnosable failure of the original jobs, and then the undiagnosable hang (which can be fixed if you remove the right uncommitted datasets). From a user perspective, having all those uncommitted datasets doesn't matter as such, but I suspect that it points to a deeper issue.

Versions

asd

Sep 05 '22 20:09 dirkgr

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

Sep 06 '22 02:09 dirkgr

Ah! I can search experiments by name, which reveals this error message:

2022-09-05T21:18:13.535899802Z {"name": "root", "msg": "[step trained_model_arc_challenge_bert-base-uncased_2147483647] Uncaught exception", "args": {"py/tuple": []}, "levelname": "ERROR", "levelno": 40, "pathname": "/opt/conda/lib/python3.9/site-packages/tango/common/logging.py", "filename": "logging.py", "module": "logging", "exc_info": {"py/tuple": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/reduce": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/tuple": ["appendErr after appendLimiter.Wait (chunkSize=2316455): googleapi: Error 429: The rate of change requests to the object fileheap-public/uploads/400c75/01gc7r8yz8thyxx16sbej4xg28_$folder$ exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"]}, {}]}, {"tb_frame": {"f_globals": {"__name__": "__main__", "__file__": "/opt/conda/bin/tango"}, "f_code": {"co_filename": "/opt/conda/bin/tango", "co_name": "<module>"}, "f_lineno": 8}, "tb_lineno": 8, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "__call__"}, "f_lineno": 1130}, "tb_lineno": 1130, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "main"}, "f_lineno": 1098}, "tb_lineno": 1055, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1657}, "tb_lineno": 1657, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1404}, "tb_lineno": 1404, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 760}, "tb_lineno": 760, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.__main__", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py", "co_name": "beaker_executor_run"}, "f_lineno": 393}, "tb_lineno": 393, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.executor", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/executor.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/executor.py", "co_name": "execute_step"}, "f_lineno": 68}, "tb_lineno": 68, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "ensure_result"}, "f_lineno": 615}, "tb_lineno": 615, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "result"}, "f_lineno": 589}, "tb_lineno": 586, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "_run_with_work_dir"}, "f_lineno": 431}, "tb_lineno": 424, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.workspace", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py", "co_name": "step_finished"}, "f_lineno": 163}, "tb_lineno": 163, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "__setitem__"}, "f_lineno": 150}, "tb_lineno": 144, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "_sync_step_dataset"}, "f_lineno": 75}, "tb_lineno": 72, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "sync"}, "f_lineno": 468}, "tb_lineno": 464, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "result"}, "f_lineno": 451}, "tb_lineno": 439, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "__get_result"}, "f_lineno": 394}, "tb_lineno": 391, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures.thread", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/thread.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/thread.py", "co_name": "run"}, "f_lineno": 64}, "tb_lineno": 58, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "_upload_file"}, "f_lineno": 623}, "tb_lineno": 587, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "request"}, "f_lineno": 98}, "tb_lineno": 98, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "make_request"}, "f_lineno": 92}, "tb_lineno": 90, "tb_next": null}}}}}}}}}}}}}}}}}}}}}]}, "exc_text": null, "stack_info": null, "lineno": 381, "funcName": "log_exc_info", "created": 1662412693.5311759, "msecs": 531.1758518218994, "relativeCreated": 1685521.075963974, "thread": 140370629374528, "threadName": "MainThread", "processName": "MainProcess", "process": 1, "highlighter": {"py/object": "rich.highlighter.ReprHighlighter"}}

So that explains why it's always two that fail. We're writing stuff to beaker too quickly.

Sep 06 '22 02:09 dirkgr

I will leave all this up like it is in case it helps with debugging.

Sep 06 '22 02:09 dirkgr

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

What happens here? Does it hang? Do you know where it hangs? I'm guessing the 429 error you discovered leaves this dataset in a bad state somehow.

Sep 06 '22 20:09 epwalsh

The job finishes with (in this case) one failure and a few dependent steps not run. The problems with how this goes down are these:

It shouldn't have failed in the first place. The error was a beaker error that happened while writing the results, I think. Details in https://github.com/allenai/tango/issues/386#issuecomment-1237590083.
When it fails, it should give me an easy click to the logs that show the error. Instead get a click to the dataset, which is incomplete. I can only find the logs if I search for the step part of the dataset name in my list of experiments. That's not very discoverable.
When I re-run, it should not get stuck. Currently, the second run will get stuck when it's time to start the step. This might be the 429 error, so it might be already fixed.

Sep 06 '22 20:09 dirkgr

Currently, the second run will get stuck when it's time to start the step.

What do you mean get stuck?

Sep 06 '22 20:09 epwalsh

tango tango copied to clipboard

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace

🐛 Describe the bug

Versions

tango
tango copied to clipboard