pixi
pixi copied to clipboard
Data specification
Problem description
Data Specification
pixi currently has its main concerns around: specifying and building environments, and executing known workflows. (I am aware of other functionality, like packaging and deployment). This is all in the context of a "project model", where a directory of files + metadata fully describe a unit of work.
I would like to make the argument, that data is a fundamental attribute of a project that ought to be declared at the top-level. Indeed, some projects might contain only data, and others might produce data, and yet more will expect to run some code operation in the context of particular datasets.
Whereas it is already ways to grab data as part of a pixi workflow, but we have the following drawbacks:
- including data files directly in a project directory will of course make for large projects and not work well in the context of version control and packaging (there are spceficic technologies for data version control!).
- including download scripts implied that
- the data is indeed in file form (as opposed to a SQL query, for instance
- the workflow requires the whole of each file rather than unsing range requests and internal partitioning
- the execution context requires the files on the local filesystem, which is not true for cluster work, e.g., with dask, ray or spark.
- credentials and other storage service args are knwon to the downloader
- having some function or notebook cell that defines how to grab data. Such snippets are generally fragile and lend themselves to copy&paste, whereas pixi has already shown that declarative definitions are better.
An additional drawback to all of the above methods, is that there is also no way to describe what the data is, their attributes, how to use them and how they relate to the environments and executable endpoints in the project.
I am advocating for embedding data descriptions directly into a pixi definitions (ideally in the main toml file). This would allow for searching through a set of projects to see which depend on or provide a particular type of data, while relying on a particular runtime - the kind of query a data practitioner needs to do all the time. I envisage this being particularly useful in the context of a pixi project enabled projects browser/IDE.
To be clear, this is to be a spec (i.e., metadata) only, not data files. That means that the spec can be packaged into any output artefact such as conda packages, and be used by any runtime that then installs the spec.
One possible way to describe data - although there are certainly others - is the intake library, which I maintain. It could be a reasonable fit for the furpose I propose: it describes dataset by type (parquet directory, API call, SQL query, etc) with various attributes of each (URL, storage properties) alongside a set of readers (how to produce in-memory data containers like dataframes, arrays from a given data spec). The framework is in python, so all of this getr translated into function calls to third-party libraries such as pandas.
You might find the intake approach interesting - either to directly use (intake's
default data catalog format is yaml) or to draw inspiration from.
cc @Hofer-Julian
I think you raise an interesting idea!
Id be very curious to see what you think something like your this would look like in a pixi project?
In the simplest case, is could be as simple as including the type of thing that currently lives in an intake catalog.yaml (or similar prescription) directly in the pixi config, or a new section "[data]" which has the catalog paths.
The more important question, is how the user workflows would look. It would be simple to add things like pixi datasets list, but of course the point of data artefacts is normally to load them in the context of a batch or interactive python session. Thus, there should really also be a runtime component that can do the equivalent of "which data are available in the current project" to running python processes.
Other details are all to be worked out:
- how a project can reference another project as the source of its data assets
- how exactly best to package data spect in any artefact that pixi creates such as conda packages
- how to search through a set of projects to find data meeting some particular criteria
- what to display about a projects data holdings or dependencies when viewed in a CLI or GUI