datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Dataset namespaces

Open dmpetrov opened this issue 7 months ago • 6 comments

Description

Right now, local and global/Studio datasets use the same names, which causes confusion:

  1. dc.read_dataset("mycats") is unclear - it depends on the local state, which may be outdated or conflicting.
  2. The API is cluttered with studio=True/False flags

UPDATE:

Idea is to have dataset fully qualified name consisting of namespace, project and dataset name connected with . so schema would be <namespace>.<project>.<dataset_name> e.g dev.my_project.my_ds.

Phase 1

  • [ ] User can create namespace and project with new API, e.g. dc.namespaces.create("dev") and dc.projects.create("chatbot")
  • [ ] User can remove namespace and project with new API, e.g. dc.namespace.delete("dev") and dc.projects.delete("dev")
  • [ ] User should be able to save dataset into created namespace / project in 2 ways: - dc.use("dev", "chatbot").from_storage(...).save("text_train_ds") - dc.from_storage(...).save("dev.chatbot.text_train_ds")

Questions:

  1. Should we add namespace and project in Settings instead of introducing new method DataChain.use(...)? A: use settings for now
  2. Is there a default namespace and project? Probably yes, so how should we call them? A: * local.local -> local * users.<user_name> -> Studio User cannot create new namespace explicitly in local env. If he pulls a dataset from Studio it will implicitly create namespace / project for that dataset as dataset name must stay the same Can user delete default namespace? - probably not
  3. If user can delete namespace / project, what happens with datasets that were in them - are they moved to some default namespace / project? If there is no default then we need to remove them? A: User is not allowed to delete namespace if datasets are inside of it
  4. Is user allowed to create dataset withoug fully qualified name (or using .use()) and if yes, does it put dataset into default namespace / project? e.g dc.from_storage(...).save("my-ds"). Similar, if user doesn't specify namespace / project on read do we try to find dataset in default namespace or throw error, e.g dc.read_dataset("my-ds")? A: yes, default namespace is used

Follow up

  • [ ] Add ability to move dataset from one namespace / project to another
  • [ ] Add ability to rename namespace / project?
  • [ ] Studio & local datasets refactoring (bigger project)

Questions of follow up:

  1. How should we distinghish Studio and local datasets.? A: local is reserved keyword and if something is used that is not local it will be seen as Studio dataset. e.g dev.my_project.my_ds -> Studio dataset, local.local.my_ds -> local dataset. dc.read_dataset(.dev.my_project.my_ds).save(dev.my_project.my_ds) (it can also choose different name)
  2. Should reading dataset from Studio automatically cache (save) that dataset locally with the same name / version or not? Should we have additional flag e.g dc.read_dataset(..., studio_cache=True) for this? What if there is dataset with same name / version already locally but different data (different UUID). A: we should automatically cache, no additional flag is needed. If the same dataset exists locally then throw exception?

dmpetrov avatar May 02 '25 01:05 dmpetrov

Just talked with @shcheklein - we had an idea to improve this:

Let’s use / as a prefix for global (Studio) datasets, so we can keep @ for version naming like [email protected], which will be important with upcoming SemVer support (#1076).

So:

  • Global dataset - /mycat
  • Local dataset - mycat

PS1: To keep in mind. A code should be reusable in CLI and Studio. This naming convention seems satisfies this requirements. THis code should work in both CLI and Studio:

ds = dc.read_dataset("/mycats")
ds1 = ds.filter(dc.C("color") == "Red").save("red-cats")  # <-- Local dataset
ds2 = ds1.map(....).save("/my_red_cats_with_bmi_index")

PS2: Versioning is outside the scope of this issue.

dmpetrov avatar May 02 '25 19:05 dmpetrov

Another idea: maybe use studio/mycats instead of just /mycats ? ... this is more verbose but more clear and similar to git branches naming convention where we have origin/mycats. Having only / as prefix seems like some kind of relative vs absolute path thing to me...

Also, ds = dc.read_dataset("/mycats") ran in Studio is basically the same as ds = dc.read_dataset("mycats") as local dataset is the same as Studio dataset, right?

BTW I would maybe avoid using Global terminology and use only Studio to avoid confusion and having multiple words for the same thing, WDYT?

ilongin avatar May 04 '25 22:05 ilongin

@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like myorg/[email protected] andm I'm thinking about empty org / as a default of user's team. If we introduce studio org it won't look good - specific org name.

dmpetrov avatar May 05 '25 22:05 dmpetrov

@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like myorg/[email protected] andm I'm thinking about empty org / as a default of user's team. If we introduce studio org it won't look good - specific org name.

@dmpetrov just to note, default user team will be the one written in config file (by running datachain auth team <team_name>) or added by env variable DVC_STUDIO_TEAM.

ilongin avatar May 13 '25 13:05 ilongin

@shcheklein @dmpetrov Question about datachain pull -> currently we can set optional local dataset name / version to which studio dataset will be pulled. I'm wondering if we should remove this and put into "contract" that everything pulled from Studio should have the same fully qualified name in local ... datachain pull is basically just a cache of Studio dataset anyway and I don't see any reason for users to have it as different name locally and this option to rename it locally just complicates things specially now that we will have namespaces / projects....

ilongin avatar May 28 '25 21:05 ilongin

Sure, let's keep it simple.

If user need a special local name, they can read dataset and save under local name.

dmpetrov avatar May 29 '25 17:05 dmpetrov