arche issues

Results 27 arche issues

Sort by recently updated

Adding demo spider

Fixes #144 I created this spider to improve the examples at documentation. Not sure if we have to merge or just keep it here in an open PR for future...

andersonberg

Deprecate/rewrite Data Quality Report

Current DQR https://arche.readthedocs.io/en/latest/nbs/DQR.html is: * scores based on schema validation and some rules\stats https://github.com/scrapinghub/arche/blob/master/src/arche/quality_estimation_algorithm.py * table of job stats * some rules summary * coverage graph (same as in the...

manycoding

Type: Question

Price rules rehaul

Refactor compare_was_now, compare_prices_for_same_urls, compare_names_for_same_urls, compare_prices_for_same_names compare_was_now: * compares numeric value between two job fields * Outputs >,

manycoding

Type: Feature

output df and raw data sample

Wouldn't it be nice to have samples each time you launch?

manycoding

Type: Feature

good first issue

Compare items between job

As per https://github.com/scrapinghub/arche/pull/167#issuecomment-533210953 ```res = difference(left_df, right_df, key) res.show() >>>55 items hasn't changed 800 items were added 200 items changed res.additional_stats.added >>>key column_x ... column_f ... res.additional_stats.changed key column_x_left column_x_right...

manycoding

Type: Feature

High level API redesign

I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema. I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList: `a...

manycoding

Type: Question

Type: API

Compare field values between two jobs

#157 We need to compare values between items found by key. ``` compare(source_df, target_df, fields=["price", "name"], key=["id"], num_threshold = 0.1) >>>Compare: FAILED think about what to display ```

manycoding

Type: Feature

Field counts doesn't trigger error for jobs with lots of NaN

``` ghat = Gatf(source='364692/1/17', schema='schemas/Global Strategies/amazon_product.json', target='364692/1/14', # schema=schema ) ``` The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to...

manycoding

Type: Bug

Dataset kaggle-like stats

https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants @ejulio suggested that we can use something like dataset stats from kaggle .com. There are lots of data, but perhaps we can really use something particular.

manycoding

Type: Question

Ugly progress bar if using Pool while downloading items

There is a bug, maybe this one - https://github.com/tqdm/tqdm/issues/485 which prevents from using tqdm_notebook in JupyterHub, Lab or Notebook. At the moment the output is blank. The easiest way seemed...

manycoding

Type: Bug

good first issue

arche
arche copied to clipboard

Metadata

Adding demo spider

Deprecate/rewrite Data Quality Report

Price rules rehaul

output df and raw data sample

Compare items between job

High level API redesign

Compare field values between two jobs

Field counts doesn't trigger error for jobs with lots of NaN

Dataset kaggle-like stats

Ugly progress bar if using Pool while downloading items

← Metadata

Owner

Metadata

arche arche copied to clipboard

Metadata

← Metadata

Owner

Metadata

arche
arche copied to clipboard