revscoring
revscoring copied to clipboard
New unpack_observations_tsv function to
We have a utility https://github.com/wikimedia/revscoring/blob/master/revscoring/utilities/dump_cache.py
It looks like this does the opposite of this function. I wonder if it would be a good idea to implement this method as part of a utility like that.
Either way, I think we need some comments to explain what is going on in the function.
Thanks @halfak for mentioning this utility.
I agree that this function could be moved to a separate file like the dump_cache.py
.
I am still struggling to understand how all the classes (Dependent
, Feature
, Extractor
, Context
...) and their functions (solve
, expand
, extract
...) are linked together and how there are used and could be used in my case.
I'll keep digging into and playing with the revscoring
code to deepen my understanding of it, but if you have any documentation that could help me (I am already going through the official documentation, which doesn't contain the dump_cache.py
btw), I'd be glad.
I'll improve my comments to match the project documentation of course.
A good place to start re. Features and Extractors is https://github.com/wikimedia/revscoring/blob/master/ipython/feature_engineering.ipynb
Good point that dump_cache.py
isn't documented. When I find some time, I'll look through the utilities and make a suggestion for how we might repurpose your code as an inverse utility and get some documentation up for dump_cache.
Thanks @halfak . I already went through all the Jupytier notebook in revscoring
and editquality
.
It helped a bit, but I think I need more time and practice 😉.
I tried to execute the run function from dump_cache.py
using observations created from the reverted_detection_demo.ipynb notebook from editquality
.
To summarize, here is the notebook simplified code snippet (only the interesting part is given) I executed locally:
training_features_reverted = []
for rev_id, reverted in train_set[:20]:
feature_values = list(api_extractor.extract(rev_id, features))
observation = {"rev_id": rev_id, "cache": feature_values, "reverted": reverted}
training_features_reverted.append(observation)
from revscoring.utilities import dump_cache
dump_cache.run(training_features_reverted, sys.stdout, features, 'reverted', None, [], True)
But I get this error:
Traceback (most recent call last):
File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/utilities/dump_cache.py]()", line 80, in run
feature_values = list(solve(features, cache=ob['cache']))
File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/dependencies/functions.py]()", line 269, in _solve_many
profile=profile)
File "[/Users/simonmaignan/workspace/wikimedia/ml/revscoring/revscoring/dependencies/functions.py]()", line 200, in _solve
return cache[dependent], cache, history
TypeError: list indices must be integers or slices, not Feature
From what I understood, the solve function called inside the dump_cache
run
function requires that its cache
argument be a dict
(see doc).
However, in my case, the given cache
retrieved from observation['cache']
(computed in feature_values
in the snippet above) is a list
of values (int
, float
and bool
).
I'm probably mixing 2 data structures which are called cache
.
@halfak , can you explain me how to properly create the cache
dict key inside an observation?
@halfak , how should we proceed with this PR?
Do you want to have a look at it? Or should I keep digging to create a more generic unpack_tsv
utility using some similar functionalities as in dump_cache.py
?
Thanks for your support in advance.