Remove `name` and `version` properties from `DataChain`
Currently we have name and version properties in DataChain class which are not needed as we already have dataset property which points to underlying dataset if one exists.
Also, in addition we should think about namespace_name and project_name as well which are also properties. Probably we should remove those and think about exposing settings in general.
namespace_nameandproject_name
Please combine these two to a single namespace that contains both.
namespace_nameandproject_namePlease combine these two to a single
namespacethat contains both.
I would avoid doing this atm. All around the code we have this split into two so if we want to have only namespace that contains both we should do one bigger refactoring to be consistent everywhere.
I think Dmitry's point was to do changes on the public API level. What would be the scope for this?
I think Dmitry's point was to do changes on the public API level. What would be the scope for this?
It should not be a problem to do that, will do it in this issue. Also, I will create a follow-up to use namespace.project.dataset naming convention everywhere in our codebase where we use dataset name (internal and external APIs) to avoid namespace_name and project_name arguments in a lot of functions.
Can you scope it please before you do it?
Can you scope it please before you do it?
From my short investigation in datachain we need to change:
lib.dc.datasets.read_dataset()lib.dc.datasets.delete_dataset()lib.dc.datachain.DataChain.settings()
We need to check Studio usage of those and change them as well if needed.
Overall I think it's not a big task, but we do change public API - that's the biggest issue here
How about env variables that they use now? What else are we missing? can we carefully grep and think what exactly we'll break?
how will the new API look like exactly (env vars, settings, read_dataset) - can you please describe it here ?
can we make it with some deprication to give some time to migrate first and then drop support later?
please scope it e2e with a proper plan and ETA
New API will look the same for now, except that namespace argument we will now cover both namespace and project .. in future, after some deprecation time it would go from
def read_dataset(
name: str,
namespace: Optional[str] = None,
project: Optional[str] = None,
...
)
to
def read_dataset(
name: str,
namespace: Optional[str] = None,
...
)
These are the options to call this method:
dc.read_dataset("cats") # default namespace / project is used
dc.read_dataset("cats", namespace="dev.animals") # "dev" namespace and "animals" project is used
We need to decide if it makes sense to allow something like this:
dc.read_dataset("cats", namespace="dev") # namespace "dev" and default project (?) - default project exists only in default namespace so this is not clear.
dc.read_dataset("cats", namespace=".animals") # project "animals" in default namespace - this makes more sense than above example
Regarding env variables, only DATACHAIN_NAMESPACE would be enough.
All should be depreciated and backward compatible.