framework Proposal for CLI redesign

I have to admit, that apart from asking for an option to use CLI to validate only selected resource defined in a package (and do similarly sub-resource oriented actions) it turned into request to redesign CLI what is not a small thing.

Please, bear with me. Frictionless Data seem to be my fulfilled dream (I was in search for something like that for years), on the other hand I got confused by CLI too many times. As creating CLI tools is something I do for many years, I tried to describe an alternative, which could probably remove some problems I have experienced.

Real world example: validate single resource from a package

I have 24 CSV files, some of them rather long, and want to specify package. The structure is complex, there are constraints, primary keys, foreign keys.

frictionless describe *.CSV > package.yaml creates one large package descriptor, but for fine tuning resource definitions I need to validate them repeatedly.

It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by: frictionless validate --resource countries package.yaml

Note, that due to the current CLI design, which tries to assume or detect many things, it is not always very clear what is going to happen, e.g.:

what type of object is expected as source of information for validation
what object will be really validated

Gallery of similar scenarios

extract single table from tabular data package
extract single table from Excel file (table defined on sheet named "classes")
validate single table from Excel file (table defined on sheet named "classes")
extract single table from SQL database (table named "classes")
validate single table from SQL database (table named "classes")

Aspects of CLI call

The CLI call must explicitly or implicitly decide set of questions:

what type of operation to perform:
- describe
- validate
- extract
- summary
definition of rules to follow
- data package descriptor
  - explicitly stated
  - inferred
- resource descriptor
  - explicitly stated
  - inferred
- schema descriptor
  - explicitly stated
  - inferred
object to operate on:
- whole data package (all it's resources)
- specific resource
  - named resource within a descriptor
  - sheet in Excel file
  - table in SQL database
  - etc (plugin dependent)
output content modifier
- e.g. filters on columns, rows, offset, limit...
output format
- plain text (incl. tables)
- JSON
- YAML
- CSV

Personally I got surprised a few times:

the CSV files to operate on can be specified explicitly or indirectly by resource or package descriptor - nice
frictionless describe *.csv creates data package descriptor with inferred schemas for all the CSV files - nice
frictionless describe *.resource.yaml creates data package consisting of the *.resource.yaml files and does not take into account the resources defined in them - bad
frictionless extract package.yaml complained about unexpected structures in the descriptor (which was not tabular). It would be better to expect very specific input and complain, that the profile declare is not the one expected

I understand the intention to provide "easy to use and intuitive tool" but in fact, auto-detection of things might bring confusion which finally make things more complicated and less predictable.

Technical options for CLI

There is click library which is currently used. click allows nested sub-commands (without real limit in nesting). This concept shall be expressive enough to provide all required input information. Another advantage of more specific (sub)commands is, that they could be more strict on provided input and complain in more specific way addressing the thing to fix more specifically.

Another option is similar to current frictionless transform where the pipeline can be very specific on what shall be done. Anyway, I am afraid this approach would be less user friendly as it requires to learn how to define the pipeline.

Last option is to use some (sub)resource addressing scheme similarly as pytest specifies which test to run, e.g. pytest test_mod.py::test_func. Similar approach could be used to specify a tabular resource defined within tabular data package.

Some CLI examples

Here are some examples of alternative CLI design. It builds on:

"verb-noun" design
(click) sub-commands
rule: do not pass values for always required arguments via options but as positional arguments.
rule: if some (positional) arguments are variable in number of occurrences, they must go as last ones
keep separate CLI signature for each specific case. This allows the CLI to be very specific and do not mess with options or arguments which are not relevant.

The infer variants would become really format specific:

describe:
- schema:
  - frictionless describe schema from-csv COUNTRIES.csv
  - frictionless describe schema from-excel data.xlsx countries
  - frictionless describe schema from-sql sqlite://data.db countries
- resource:
  - frictionless describe resource from-csv COUNTRIES.csv
  - frictionless describe resource from-csv COUNTRIES.csv --schema countries.schema.yaml
  - frictionless describe resource from-excel data.xlsx countries
  - frictionless describe resource from-sql sqlite://data.db countries
- package:
  - frictionless describe package from-csv *.csv
  - frictionless describe package from-resources *.resource.yaml
  - frictionless describe package from-excel data.xlsx
  - frictionless describe package from-excel data.xlsx countries
  - frictionless describe package from-excel data.xlsx countries classes
  - frictionless describe package from-sql sqlite://data.db
  - frictionless describe package from-sql sqlite://data.db countries
  - frictionless describe package from-sql sqlite://data.db countries classes
validate:
- descriptor:
  - frictionless validate descriptor schema countries.schema.yaml
  - frictionless validate descriptor resource countries.resource.yam
  - frictionless validate descriptor package package.yaml
- resource:
  - frictionless validate resource from-descriptor countries.resource.yaml
  - frictionless validate resource from-csv COUNTRIES.csv
  - frictionless validate resource from-excel data.xls countries
  - frictionless validate resource from-sql sqlite://data.db countries
- package:
  - frictionless validate package from-csv *.csv
  - frictionless validate package from-excel data.xls
extract:
- package:
  - frictionless extract package from-descriptor package.yaml
  - frictionless extract package from-excel data.xls
  - frictionless extract package from-sql sqlite://data.db
  - frictionless extract package from-sql sqlite://data.db countries
  - frictionless extract package from-sql sqlite://data.db countries classes
- resource:
  - frictionless extract resource from-package package.yaml countries
  - frictionless extract resource from-descriptor countries.resource.yaml
  - frictionless extract resource from-csv COUNTRIES.csv
  - frictionless extract resource from-excel data.xls countries
  - frictionless extract resource from-sql sqlite:///data.db countries

What to do next

The proposal above is definitely not complete (missing api, summary and transform), but it should allow first evaluation if the proposal seems reasonable.

If you would agree on it, I could contribute stub click command implementation to prove, it would be very instructive to users.

Jul 05 '22 23:07 vlcinsky

Hi @vlcinsky,

Thanks for a great and detailed issue! I don't think we can introduce such breaking change for the whole CLI so might be an alternative CLI runner package might be an option?

Jul 11 '22 13:07 roll

Do you mean separate python package (installed independently from frictionless-py) or alternative CLI within existing package?

You are right, the change is extensive. In long term I could imagine, we start with alternative CLI within frictionless-py, keep it marked as experimental or beta for a while until it matures and finally deprecate the existing CLI to keep it more manageable.

Jul 12 '22 07:07 vlcinsky

Yes I think an additional CLI runner package similar to js projects like webpack-cli. It might be a good first step to test the idea

Jul 13 '22 15:07 roll

It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by: frictionless validate --resource countries package.yaml

Just wanted to point out that after https://github.com/frictionlessdata/framework/pull/1112 it's possible to validate a single resource from a data package with

frictionless validate --json --resource-name foo datapackage.json

This will correctly identify eventual validation errors coming from foreign keys constraints.

Feb 23 '23 22:02 fjuniorr

framework framework copied to clipboard

Proposal for CLI redesign

Real world example: validate single resource from a package

Gallery of similar scenarios

Aspects of CLI call

Technical options for CLI

Some CLI examples

What to do next

framework
framework copied to clipboard