dvclive icon indicating copy to clipboard operation
dvclive copied to clipboard

`log_artifact`: external and non-DVC tracked files support

Open shcheklein opened this issue 2 years ago • 8 comments

More of a question for now:

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.
  • How should we treat S3 files? dvc.yaml should support them I think. Do we need to create import file .dvc or not?

shcheklein avatar May 01 '23 05:05 shcheklein

Feels like the questions are assuming log_artifact is coupled with the model registry but strictly speaking, it is not.

As of today, it is decoupled from an implementation perspective (log_artifact creates the .dvc but it is make_dvcyaml that writes the artifacts section), but I would also like to think that it should not be coupled from a product perspective.

For those scenarios, why would you want to use log_artifact python API for registering the model? It is more convenient than writing the artifacts section in the dvc.yaml or using the UI?

If we still want a Python API, should we make it part of dvc.api? Does it belong in DVCLive logger?

daavoo avatar May 01 '23 08:05 daavoo

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.

The use case I can think of is huggingfaces integration. Is that what you have in mind?

Would we also make dvc get work with git-lfs? Do you have a use case where model registry is useful without being able to retrieve the artifact?

  • How should we treat S3 files? dvc.yaml should support them I think. Do we need to create import file .dvc or not?

@daavoo If we import the model, isn't that part of the core functionality of log_artifact? I think this is an interesting idea because it helps introduce a way to manage external data, which is a major source of confusion today.

dberenbaum avatar May 01 '23 14:05 dberenbaum

Would we also make dvc get work with git-lfs?

I don't know yet, at the end people might decide on their own also how exactly bring the artifact from a commit.

shcheklein avatar May 01 '23 14:05 shcheklein

I don't know yet, at the end people might decide on their own also how exactly bring the artifact from a commit.

Can we come up with a use case where model registry is needed in this scenario?

dberenbaum avatar May 01 '23 14:05 dberenbaum

Can we come up with a use case where model registry is needed in this scenario?

To be honest, I don't see the difference is it DVC-tracked or not. All the same scenarios apply, no? Find a specific version of a model (by a tag) and fetch it to deploy. Assign stages, etc, etc. In this case dvc.yaml helps to see them in the MR + to see some additional metadata.

Could you may be clarify your question, @dberenbaum ?

shcheklein avatar May 01 '23 16:05 shcheklein

people might decide on their own also how exactly bring the artifact from a commit

Find a specific version of a model (by a tag) and fetch it to deploy.

How do you envision this workflow if the artifact is managed by git lfs? What commands would I run in my deploy script?

dberenbaum avatar May 01 '23 17:05 dberenbaum

@dberenbaum I'm not that familiar with Git lfs, but from what I remember you could probably manage it with git pull, or in case of S3 (e.g. HF does it with Git lfs) even get a link to an artifact. Again, not 100% sure, but I would be surprised if there is a limitation like nor being able to fetch a file from a specific revision.

shcheklein avatar May 01 '23 17:05 shcheklein

There are two mechanisms we could use in dvc for this:

  1. Use dvc import-url --no-download. This already exists and allows the user to still have the option to get/pull the data into the repo later, but it only works for external files (I don't think it will work with git-lfs files for example).
  2. We could easily add some option like dvc add --no-cache which would add cache: false to the resulting .dvc file and work with external files. You can't retrieve the files, but it's simpler and closer to what other loggers provide for external files (and probably simpler for cli users looking to track external files).

Neither of these automatically detect whether the files are version-aware today. It would be great if we can add support for that in dvc since I see it in other loggers, but I can't remember the obstacles to doing it (cc @pmrowla).

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.

Neptune is the only logger I have found that supports tracking local files without uploading them, so I'm not sure it should be a high priority, but it's possible to support it with option 2 above.

How should we expose this functionality in dvclive? Some options:

  1. Use it automatically for external artifacts (Live.log_artifact("s3://...")).
  2. Change Live.log_artifact(cache=False) to use this (we may have to tweak the lightning callback).
  3. Add another arg for it in Live.log_artifact().
  4. Add a separate method like Live.log_url() or Live.log_reference().

Some other loggers for comparison (note that mlflow does not support this pattern at all AFAICT):

  • https://docs.wandb.ai/guides/artifacts/track-external-files
  • https://docs.neptune.ai/logging/artifacts/
  • https://www.comet.com/docs/v2/guides/data-management/remote-artifacts/

dberenbaum avatar Aug 03 '23 21:08 dberenbaum