revscoring New unpack_observations

Cannot be merged without wikimedia/editquality#237.

Will solve Phabricator Wikimedia T300730

Feb 02 '22 18:02 Simonmaignan

We have a utility https://github.com/wikimedia/revscoring/blob/master/revscoring/utilities/dump_cache.py

It looks like this does the opposite of this function. I wonder if it would be a good idea to implement this method as part of a utility like that.

Either way, I think we need some comments to explain what is going on in the function.

Feb 03 '22 17:02 halfak

Thanks @halfak for mentioning this utility. I agree that this function could be moved to a separate file like the dump_cache.py.

I am still struggling to understand how all the classes (Dependent, Feature, Extractor, Context...) and their functions (solve, expand, extract...) are linked together and how there are used and could be used in my case. I'll keep digging into and playing with the revscoring code to deepen my understanding of it, but if you have any documentation that could help me (I am already going through the official documentation, which doesn't contain the dump_cache.py btw), I'd be glad.

I'll improve my comments to match the project documentation of course.

Feb 04 '22 13:02 Simonmaignan

A good place to start re. Features and Extractors is https://github.com/wikimedia/revscoring/blob/master/ipython/feature_engineering.ipynb

Good point that dump_cache.py isn't documented. When I find some time, I'll look through the utilities and make a suggestion for how we might repurpose your code as an inverse utility and get some documentation up for dump_cache.

Feb 04 '22 17:02 halfak

Thanks @halfak . I already went through all the Jupytier notebook in revscoring and editquality. It helped a bit, but I think I need more time and practice 😉.

I tried to execute the run function from dump_cache.py using observations created from the reverted_detection_demo.ipynb notebook from editquality. To summarize, here is the notebook simplified code snippet (only the interesting part is given) I executed locally:

training_features_reverted = []
for rev_id, reverted in train_set[:20]:
    feature_values = list(api_extractor.extract(rev_id, features))
    observation = {"rev_id": rev_id, "cache": feature_values, "reverted": reverted}
    training_features_reverted.append(observation)

from revscoring.utilities import dump_cache
dump_cache.run(training_features_reverted, sys.stdout, features, 'reverted', None, [], True)

But I get this error:

Traceback (most recent call last):
  File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/utilities/dump_cache.py]()", line 80, in run
    feature_values = list(solve(features, cache=ob['cache']))
  File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/dependencies/functions.py]()", line 269, in _solve_many
    profile=profile)
  File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/dependencies/functions.py]()", line 200, in _solve
    return cache[dependent], cache, history
TypeError: list indices must be integers or slices, not Feature

From what I understood, the solve function called inside the dump_cache run function requires that its cache argument be a dict (see doc). However, in my case, the given cache retrieved from observation['cache'] (computed in feature_values in the snippet above) is a list of values (int, float and bool).

I'm probably mixing 2 data structures which are called cache.

@halfak , can you explain me how to properly create the cache dict key inside an observation?

Feb 04 '22 18:02 Simonmaignan

@halfak , how should we proceed with this PR? Do you want to have a look at it? Or should I keep digging to create a more generic unpack_tsv utility using some similar functionalities as in dump_cache.py?

Thanks for your support in advance.

Mar 08 '22 17:03 Simonmaignan

revscoring
revscoring copied to clipboard

New unpack_observations_tsv function to

revscoring revscoring copied to clipboard

New unpack_observations_tsv function to

revscoring
revscoring copied to clipboard