tango icon indicating copy to clipboard operation
tango copied to clipboard

Upload files into workspaces

Open dirkgr opened this issue 3 years ago • 0 comments

Motivation: Tango's Workspace concept makes it possible to move your workflow from one machine to another, but this breaks down when you reference file on the local file system in your code. If we can upload files into workspaces, we can get rid of this last gap in the Tango job portability story.

Files in a Tango workspace are going to be stored under two keys: A sha256 of their contents, and their full local path. We will add a special TangoFile type that extends from FromParams to represent these files. They can be initialized with the sha, or with the full local path.

  • If we see a sha, we download the file from the workspace into a local cache if necessary.
  • If we see a local path and a file exists locally at that location, we upload the file if necessary and store it under both a sha and the full local path.
  • If we see a local path and no file exists locally at that location, we try to find it in the workspace. This way, local paths keep working when you move from machine to machine, as long as you ran at least once where the path existed.

We treat directories as if they were files. The content of a directory is a list of shas of its contents, sorted alphabetically (so that the sha of the directory is always the same). That means we have to keep a special flag somewhere that marks a file as a directory, and some logic into TangoFile that can resolve directories without downloading all files in it. That's important because sometimes directories contain 100GB of images, and you don't want to wait for your training to start until all the images have been downloaded.

dirkgr avatar May 17 '22 19:05 dirkgr