framework
framework copied to clipboard
Proposal for CLI redesign
I have to admit, that apart from asking for an option to use CLI to validate only selected resource defined in a package (and do similarly sub-resource oriented actions) it turned into request to redesign CLI what is not a small thing.
Please, bear with me. Frictionless Data seem to be my fulfilled dream (I was in search for something like that for years), on the other hand I got confused by CLI too many times. As creating CLI tools is something I do for many years, I tried to describe an alternative, which could probably remove some problems I have experienced.
Real world example: validate single resource from a package
I have 24 CSV files, some of them rather long, and want to specify package. The structure is complex, there are constraints, primary keys, foreign keys.
frictionless describe *.CSV > package.yaml creates one large package descriptor, but for fine tuning resource definitions I need to validate them repeatedly.
It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by: frictionless validate --resource countries package.yaml
Note, that due to the current CLI design, which tries to assume or detect many things, it is not always very clear what is going to happen, e.g.:
- what type of object is expected as source of information for validation
- what object will be really validated
Gallery of similar scenarios
- extract single table from tabular data package
- extract single table from Excel file (table defined on sheet named "classes")
- validate single table from Excel file (table defined on sheet named "classes")
- extract single table from SQL database (table named "classes")
- validate single table from SQL database (table named "classes")
Aspects of CLI call
The CLI call must explicitly or implicitly decide set of questions:
- what type of operation to perform:
- describe
- validate
- extract
- summary
- definition of rules to follow
- data package descriptor
- explicitly stated
- inferred
- resource descriptor
- explicitly stated
- inferred
- schema descriptor
- explicitly stated
- inferred
- data package descriptor
- object to operate on:
- whole data package (all it's resources)
- specific resource
- named resource within a descriptor
- sheet in Excel file
- table in SQL database
- etc (plugin dependent)
- output content modifier
- e.g. filters on columns, rows, offset, limit...
- output format
- plain text (incl. tables)
- JSON
- YAML
- CSV
Personally I got surprised a few times:
- the CSV files to operate on can be specified explicitly or indirectly by resource or package descriptor - nice
frictionless describe *.csvcreates data package descriptor with inferred schemas for all the CSV files - nicefrictionless describe *.resource.yamlcreates data package consisting of the *.resource.yaml files and does not take into account the resources defined in them - badfrictionless extract package.yamlcomplained about unexpected structures in the descriptor (which was not tabular). It would be better to expect very specific input and complain, that the profile declare is not the one expected
I understand the intention to provide "easy to use and intuitive tool" but in fact, auto-detection of things might bring confusion which finally make things more complicated and less predictable.
Technical options for CLI
There is click library which is currently used. click allows nested sub-commands (without real limit in nesting). This concept shall be expressive enough to provide all required input information. Another advantage of more specific (sub)commands is, that they could be more strict on provided input and complain in more specific way addressing the thing to fix more specifically.
Another option is similar to current frictionless transform where the pipeline can be very specific on what shall be done. Anyway, I am afraid this approach would be less user friendly as it requires to learn how to define the pipeline.
Last option is to use some (sub)resource addressing scheme similarly as pytest specifies which test to run, e.g. pytest test_mod.py::test_func. Similar approach could be used to specify a tabular resource defined within tabular data package.
Some CLI examples
Here are some examples of alternative CLI design. It builds on:
- "verb-noun" design
- (click) sub-commands
- rule: do not pass values for always required arguments via options but as positional arguments.
- rule: if some (positional) arguments are variable in number of occurrences, they must go as last ones
- keep separate CLI signature for each specific case. This allows the CLI to be very specific and do not mess with options or arguments which are not relevant.
The infer variants would become really format specific:
- describe:
- schema:
frictionless describe schema from-csv COUNTRIES.csvfrictionless describe schema from-excel data.xlsx countriesfrictionless describe schema from-sql sqlite://data.db countries
- resource:
frictionless describe resource from-csv COUNTRIES.csvfrictionless describe resource from-csv COUNTRIES.csv --schema countries.schema.yamlfrictionless describe resource from-excel data.xlsx countriesfrictionless describe resource from-sql sqlite://data.db countries
- package:
frictionless describe package from-csv *.csvfrictionless describe package from-resources *.resource.yamlfrictionless describe package from-excel data.xlsxfrictionless describe package from-excel data.xlsx countriesfrictionless describe package from-excel data.xlsx countries classesfrictionless describe package from-sql sqlite://data.dbfrictionless describe package from-sql sqlite://data.db countriesfrictionless describe package from-sql sqlite://data.db countries classes
- schema:
- validate:
- descriptor:
frictionless validate descriptor schema countries.schema.yamlfrictionless validate descriptor resource countries.resource.yamfrictionless validate descriptor package package.yaml
- resource:
frictionless validate resource from-descriptor countries.resource.yamlfrictionless validate resource from-csv COUNTRIES.csvfrictionless validate resource from-excel data.xls countriesfrictionless validate resource from-sql sqlite://data.db countries
- package:
frictionless validate package from-csv *.csvfrictionless validate package from-excel data.xls
- descriptor:
- extract:
- package:
frictionless extract package from-descriptor package.yamlfrictionless extract package from-excel data.xlsfrictionless extract package from-sql sqlite://data.dbfrictionless extract package from-sql sqlite://data.db countriesfrictionless extract package from-sql sqlite://data.db countries classes
- resource:
frictionless extract resource from-package package.yaml countriesfrictionless extract resource from-descriptor countries.resource.yamlfrictionless extract resource from-csv COUNTRIES.csvfrictionless extract resource from-excel data.xls countriesfrictionless extract resource from-sql sqlite:///data.db countries
- package:
What to do next
The proposal above is definitely not complete (missing api, summary and transform), but it should allow first evaluation if the proposal seems reasonable.
If you would agree on it, I could contribute stub click command implementation to prove, it would be very instructive to users.
Hi @vlcinsky,
Thanks for a great and detailed issue! I don't think we can introduce such breaking change for the whole CLI so might be an alternative CLI runner package might be an option?
Do you mean separate python package (installed independently from frictionless-py) or alternative CLI within existing package?
You are right, the change is extensive. In long term I could imagine, we start with alternative CLI within frictionless-py, keep it marked as experimental or beta for a while until it matures and finally deprecate the existing CLI to keep it more manageable.
Yes I think an additional CLI runner package similar to js projects like webpack-cli. It might be a good first step to test the idea
It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by:
frictionless validate --resource countries package.yaml
Just wanted to point out that after https://github.com/frictionlessdata/framework/pull/1112 it's possible to validate a single resource from a data package with
frictionless validate --json --resource-name foo datapackage.json
This will correctly identify eventual validation errors coming from foreign keys constraints.