arche icon indicating copy to clipboard operation
arche copied to clipboard

High level API redesign

Open manycoding opened this issue 5 years ago • 2 comments

I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema. I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList:

a = Arche(data, cat_names=["size"], cont_names=["price"], uniques=["id", ("url", "title", "price")]) So then duplicates will use uniques, i.e. check if all id are unique and all rows have unique url and title Categories will use cat_names cont_names is just an example, but can be used to determine numerical data, and then plot some stats like deviation, percentiles and such.

Thoughts? @ejulio @raphapassini @victor-torres @alexander-matsievsky

manycoding avatar Jun 25 '19 22:06 manycoding

This is a good idea. Probably it would be easier than jsonschema to write some validations and checks :smile: . Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names. If cat_names is a list of categories, I'd go with categories and if they are columns in the df then category_columns. Same follows for other configurations.

Another idea is that, data shouldn't go with Arche. I'd prefer to instantiate Arche as check template and then feed any data trough methods. This would be a good fit for multi-job checks.

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(my configs here)

a.report_all(job1_data)
a.report_all(job2_data)

ejulio avatar Jul 01 '19 14:07 ejulio

@ejulio

Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.

I kind of started to like these abbreviations after getting familiar with fastai. The learning curve is the same since you have to check docstrings anyway, but with shorter names the code is smaller.

Another idea is that, data shouldn't go with Arche.

I suggested something similar in #69

source_items = Items.from_something(start, count)
target_items = Items.from_something(start, count)

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(schema, categories, continuous, uniques)

a.report(source_items)
a.report(target_items)

a.compare(source_items, target_items)

manycoding avatar Jul 01 '19 15:07 manycoding