Valeriy Mukhtarulin issues

Results 42 issues of


                                            Valeriy Mukhtarulin

Compare field values between two jobs

#157 We need to compare values between items found by key. ``` compare(source_df, target_df, fields=["price", "name"], key=["id"], num_threshold = 0.1) >>>Compare: FAILED think about what to display ```

Type: Feature

Field counts doesn't trigger error for jobs with lots of NaN

``` ghat = Gatf(source='364692/1/17', schema='schemas/Global Strategies/amazon_product.json', target='364692/1/14', # schema=schema ) ``` The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to...

Type: Bug

Dataset kaggle-like stats

https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants @ejulio suggested that we can use something like dataset stats from kaggle .com. There are lots of data, but perhaps we can really use something particular.

Type: Question

Ugly progress bar if using Pool while downloading items

There is a bug, maybe this one - https://github.com/tqdm/tqdm/issues/485 which prevents from using tqdm_notebook in JupyterHub, Lab or Notebook. At the moment the output is blank. The easiest way seemed...

Type: Bug

good first issue

Allow to set parameteres like threshold

A reloadable config or passing arguments is needed, so anybody can set `report_all()` at once: E.g. we have `threshold` for coverage diff, which defaults to 0.2, so the config might...

Type: Feature

good first issue

Add environment packages

Similar to https://github.com/fastai/fastai/blob/master/setup.py The goal is to have an easy-to-set environment, since environment are not dependencies by nature. E.g. The library should run in Jupyter, but Jupyter is not a...

good first issue

Type: Docs

Make json schema and python regex the same

There is some difference between schemas in files and `dict`. In particular, all `\` in files should be double escaped, meaning we have this `"^https?://www\\.realtor\\.ca/propertyDetails\\.aspx\\?PropertyId=[0-9]+$"` While python `jsons` can eat...

Type: Feature

good first issue

Better results API

To allow report customization, results can have a better API. The [current one](https://github.com/scrapinghub/arche/blob/master/src/arche/rules/result.py) looks like: ``` arche.report.results.get("JSON Schema Validation") Result( name='JSON Schema Validation', messages={ :[ Message(summary='34021 items were checked, 3...

Type: Feature

Type: Question

Is modin worth it

https://github.com/modin-project/modin They claim a lot, let's see what we get with the actual data. I feel like the only thing which really makes the difference (100x times) is numpy and...

Type: Question

Type: Performance

Schema methods should fail better if schema is not provided

Even for me it takes some seconds to figure what it just doesn't work. I see it's either a minus in a design - e.g. it should feel like you...

good first issue