arche icon indicating copy to clipboard operation
arche copied to clipboard

Analyze scraped data

Results 27 arche issues
Sort by recently updated
recently updated
newest added

Fixes #144 I created this spider to improve the examples at documentation. Not sure if we have to merge or just keep it here in an open PR for future...

Current DQR https://arche.readthedocs.io/en/latest/nbs/DQR.html is: * scores based on schema validation and some rules\stats https://github.com/scrapinghub/arche/blob/master/src/arche/quality_estimation_algorithm.py * table of job stats * some rules summary * coverage graph (same as in the...

Type: Question

Refactor compare_was_now, compare_prices_for_same_urls, compare_names_for_same_urls, compare_prices_for_same_names compare_was_now: * compares numeric value between two job fields * Outputs >,

Type: Feature

Wouldn't it be nice to have samples each time you launch?

Type: Feature
good first issue

As per https://github.com/scrapinghub/arche/pull/167#issuecomment-533210953 ```res = difference(left_df, right_df, key) res.show() >>>55 items hasn't changed 800 items were added 200 items changed res.additional_stats.added >>>key column_x ... column_f ... res.additional_stats.changed key column_x_left column_x_right...

Type: Feature

I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema. I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList: `a...

Type: Question
Type: API

#157 We need to compare values between items found by key. ``` compare(source_df, target_df, fields=["price", "name"], key=["id"], num_threshold = 0.1) >>>Compare: FAILED think about what to display ```

Type: Feature

``` ghat = Gatf(source='364692/1/17', schema='schemas/Global Strategies/amazon_product.json', target='364692/1/14', # schema=schema ) ``` The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to...

Type: Bug

https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants @ejulio suggested that we can use something like dataset stats from kaggle .com. There are lots of data, but perhaps we can really use something particular.

Type: Question

There is a bug, maybe this one - https://github.com/tqdm/tqdm/issues/485 which prevents from using tqdm_notebook in JupyterHub, Lab or Notebook. At the moment the output is blank. The easiest way seemed...

Type: Bug
good first issue