arche
arche copied to clipboard
High level API redesign
I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema. I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList:
a = Arche(data, cat_names=["size"], cont_names=["price"], uniques=["id", ("url", "title", "price")])
So then duplicates will use uniques
, i.e. check if all id
are unique and all rows have unique url and title
Categories will use cat_names
cont_names
is just an example, but can be used to determine numerical data, and then plot some stats like deviation, percentiles and such.
Thoughts? @ejulio @raphapassini @victor-torres @alexander-matsievsky
This is a good idea.
Probably it would be easier than jsonschema to write some validations and checks :smile: .
Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names
over cat_names
.
If cat_names
is a list of categories, I'd go with categories
and if they are columns in the df then category_columns
.
Same follows for other configurations.
Another idea is that, data
shouldn't go with Arche
.
I'd prefer to instantiate Arche
as check template and then feed any data trough methods.
This would be a good fit for multi-job checks.
# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(my configs here)
a.report_all(job1_data)
a.report_all(job2_data)
@ejulio
Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.
I kind of started to like these abbreviations after getting familiar with fastai. The learning curve is the same since you have to check docstrings anyway, but with shorter names the code is smaller.
Another idea is that, data shouldn't go with Arche.
I suggested something similar in #69
source_items = Items.from_something(start, count)
target_items = Items.from_something(start, count)
# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(schema, categories, continuous, uniques)
a.report(source_items)
a.report(target_items)
a.compare(source_items, target_items)