Dataset namespaces
Description
Right now, local and global/Studio datasets use the same names, which causes confusion:
dc.read_dataset("mycats")is unclear - it depends on the local state, which may be outdated or conflicting.- The API is cluttered with studio=True/False flags
UPDATE:
Idea is to have dataset fully qualified name consisting of namespace, project and dataset name connected with . so schema would be <namespace>.<project>.<dataset_name> e.g dev.my_project.my_ds.
Phase 1
- [ ] User can create namespace and project with new API, e.g.
dc.namespaces.create("dev")anddc.projects.create("chatbot") - [ ] User can remove namespace and project with new API, e.g.
dc.namespace.delete("dev")anddc.projects.delete("dev") - [ ] User should be able to save dataset into created namespace / project in 2 ways:
-
dc.use("dev", "chatbot").from_storage(...).save("text_train_ds")-dc.from_storage(...).save("dev.chatbot.text_train_ds")
Questions:
- Should we add
namespaceandprojectinSettingsinstead of introducing new methodDataChain.use(...)? A: use settings for now - Is there a default namespace and project? Probably yes, so how should we call them?
A:
*
local.local-> local *users.<user_name>-> Studio User cannot create new namespace explicitly in local env. If he pulls a dataset from Studio it will implicitly create namespace / project for that dataset as dataset name must stay the same Can user delete default namespace? - probably not - If user can delete namespace / project, what happens with datasets that were in them - are they moved to some default namespace / project? If there is no default then we need to remove them? A: User is not allowed to delete namespace if datasets are inside of it
- Is user allowed to create dataset withoug fully qualified name (or using
.use()) and if yes, does it put dataset into default namespace / project? e.gdc.from_storage(...).save("my-ds"). Similar, if user doesn't specify namespace / project on read do we try to find dataset in default namespace or throw error, e.gdc.read_dataset("my-ds")? A: yes, default namespace is used
Follow up
- [ ] Add ability to move dataset from one namespace / project to another
- [ ] Add ability to rename namespace / project?
- [ ] Studio & local datasets refactoring (bigger project)
Questions of follow up:
- How should we distinghish Studio and local datasets.?
A:
localis reserved keyword and if something is used that is notlocalit will be seen as Studio dataset. e.gdev.my_project.my_ds-> Studio dataset,local.local.my_ds-> local dataset.dc.read_dataset(.dev.my_project.my_ds).save(dev.my_project.my_ds)(it can also choose different name) - Should reading dataset from Studio automatically cache (save) that dataset locally with the same name / version or not? Should we have additional flag e.g
dc.read_dataset(..., studio_cache=True)for this? What if there is dataset with same name / version already locally but different data (differentUUID). A: we should automatically cache, no additional flag is needed. If the same dataset exists locally then throw exception?
Just talked with @shcheklein - we had an idea to improve this:
Let’s use / as a prefix for global (Studio) datasets, so we can keep @ for version naming like [email protected], which will be important with upcoming SemVer support (#1076).
So:
- Global dataset -
/mycat - Local dataset -
mycat
PS1: To keep in mind. A code should be reusable in CLI and Studio. This naming convention seems satisfies this requirements. THis code should work in both CLI and Studio:
ds = dc.read_dataset("/mycats")
ds1 = ds.filter(dc.C("color") == "Red").save("red-cats") # <-- Local dataset
ds2 = ds1.map(....).save("/my_red_cats_with_bmi_index")
PS2: Versioning is outside the scope of this issue.
Another idea: maybe use studio/mycats instead of just /mycats ? ... this is more verbose but more clear and similar to git branches naming convention where we have origin/mycats. Having only / as prefix seems like some kind of relative vs absolute path thing to me...
Also, ds = dc.read_dataset("/mycats") ran in Studio is basically the same as ds = dc.read_dataset("mycats") as local dataset is the same as Studio dataset, right?
BTW I would maybe avoid using Global terminology and use only Studio to avoid confusion and having multiple words for the same thing, WDYT?
@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like myorg/[email protected] andm I'm thinking about empty org / as a default of user's team. If we introduce studio org it won't look good - specific org name.
@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like
myorg/[email protected]andm I'm thinking about empty org/as a default of user's team. If we introduce studio org it won't look good - specific org name.
@dmpetrov just to note, default user team will be the one written in config file (by running datachain auth team <team_name>) or added by env variable DVC_STUDIO_TEAM.
@shcheklein @dmpetrov Question about datachain pull -> currently we can set optional local dataset name / version to which studio dataset will be pulled. I'm wondering if we should remove this and put into "contract" that everything pulled from Studio should have the same fully qualified name in local ... datachain pull is basically just a cache of Studio dataset anyway and I don't see any reason for users to have it as different name locally and this option to rename it locally just complicates things specially now that we will have namespaces / projects....
Sure, let's keep it simple.
If user need a special local name, they can read dataset and save under local name.