neptune-client icon indicating copy to clipboard operation
neptune-client copied to clipboard

BUG: neptune sync is not fault-tolerant

Open cemde opened this issue 2 years ago • 6 comments

Describe the bug

When errors occur with a single run during neptune sync, the scipt stops, but it should skip it and print the error.

Reproduction

  1. write a neptune log from inside a docker container, s.t. there exist permission errors
  2. try to sync from outside the docker container

Works for other kinds of file corruptions as well.

Expected behavior

When neptune encounters a run it cant sync, it should skip it, continue with the next and at the end list all runs it couldnt sync.

Traceback

cornelius@pssr2:~/PCJax/logs$ neptune sync -p user/Project
Traceback (most recent call last):
  File "/users-2/cornelius/.conda/envs/pcjax/bin/neptune", line 8, in <module>
    sys.exit(main())
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/commands.py", line 173, in sync
    sync_runner.sync_all_containers(path, project_name)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 242, in sync_all_containers
    self.sync_all_offline_containers(base_path, project_name)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 220, in sync_all_offline_containers
    self.sync_offline_containers(base_path, project_name, offline_dirs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 213, in sync_offline_containers
    registered_containers = self.register_offline_containers(base_path, project, offline_dirs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 191, in register_offline_containers
    self._move_offline_container(
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 177, in _move_offline_container
    (base_path / OFFLINE_DIRECTORY / offline_dir).rename(
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/pathlib.py", line 1234, in rename
    self._accessor.rename(self, target)
PermissionError: [Errno 13] Permission denied: '/users-2/cornelius/PCJax/logs/.neptune/offline/run__ba1c7901-881f-4af6-820e-014e0a698319' -> '/users-2/cornelius/PCJax/logs/.neptune/async/run__9877526b-8f3d-4c95-a813-a66b26e926cd/exec-0-offline'

Neptune Version

neptune-client            0.16.16                  pypi_0    pypi

cemde avatar Feb 14 '23 16:02 cemde

Hey @cemde

Could you try updating to the lastest release and let me know if the issue persists?

Blaizzy avatar Feb 16 '23 12:02 Blaizzy

@Blaizzy still exists

cemde avatar Feb 16 '23 13:02 cemde

That's odd.

Has it worked in the past?

Blaizzy avatar Feb 16 '23 14:02 Blaizzy

I never noticed it before, but I also never logged from inside a docker image. The PermissionError is justified. It should just be excepted properly and then logged. in pseudo python:

objects2sync = [obj1,obj2,....]
failed_objs = []
for obj in tqdm(objects2sync):
    try:
        sync_object(obj)
    except:
        failed_objs.append(obj._id, obj_short_id, inspect.traceback())
print("Successful:", objects2sync - failed_objs)
print("Failed:", failed_objs))

cemde avatar Feb 16 '23 14:02 cemde

Let me check with the team and come back to you

Blaizzy avatar Feb 16 '23 15:02 Blaizzy

Hey @cemde

I've discussed it with the team and decided to send your issue to our product team as a feature request. They will take it from here and explore how to incorporate it into our future plans.

While I don't have an ETA for this feature, I do want to keep you in the loop. You can stay up-to-date with our product roadmap by checking out our portal at https://portal.neptune.ai/tabs/15-planned.

Thanks for sharing your feedback! Really appreciate it :)

Blaizzy avatar Feb 17 '23 15:02 Blaizzy