revscoring
revscoring copied to clipboard
Example on README did not work
Example on README
import mwapi
from revscoring import ScorerModel
from revscoring.extractors.api.extractor import Extractor
with open("models/enwiki.damaging.linear_svc.model") as f:
scorer_model = ScorerModel.load(f)
extractor = Extractor(mwapi.Session(host="https://en.wikipedia.org", user_agent="revscoring demo"))
feature_values = list(extractor.extract(123456789, scorer_model.features))
print(scorer_model.score(feature_values))
# {'prediction': True, 'probability': {False: 0.4694409344514984, True: 0.5305590655485017}}
Error message
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-a72266a26745> in <module>()
3 from revscoring.extractors.api.extractor import Extractor
4
----> 5 with open("models/enwiki.damaging.linear_svc.model") as f:
6 scorer_model = ScorerModel.load(f)
7
FileNotFoundError: [Errno 2] No such file or directory: 'models/enwiki.damaging.linear_svc.model'
I am 🤔 what should be done to get the example working.
cc: @halfak @geohacker
I found a similar example in the examples
folder which did not work either.
$ python examples/scoring.py
Traceback (most recent call last):
File "examples/scoring.py", line 5, in <module>
with open("models/enwiki.damaging.linear_svc.model") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/enwiki.damaging.linear_svc.model'
cc: @batpad
Good questions! So this example doesn't work, that's right. We'd need to rebuild a model and keep it in sync with the repository in order for this to continue to work as intended. This is hard because pickle
is our main serializer and it's pretty stupid. I don't think it would make sense to require every merged PR to update the model.
I see two good options here:
- Drop the file loading example. Loading serialized files is inherently problematic.
- Have the example create the serialized file and load it.
For (1), this might make sense because this repository doesn't store up-to-date models. For (2), we'll have the problem that you can't really train a model in < 10 lines of code in any useful way.
See https://github.com/wiki-ai/editquality for an example of repository that does store models that are sync'd to a version of this library.
See https://github.com/wiki-ai/editquality for an example of repository that does store models that are sync'd to a version of this library.
I know that a repository has trained models when the git clone
takes more time than it should. 😬
We'd like to use git-lfs for this, but our internal infra doesn't support it :(
For (2), we'll have the problem that you can't really train a model in < 10 lines of code in any useful way.
Is there any > 10 lines example available?
This is essentially the same as https://phabricator.wikimedia.org/T250635. See also https://github.com/wikimedia/revscoring/pull/486.
#486 is merged, therefore no need to mention that. Although I would like @bkowshik to try out the new example before continuing. If the new example is OK, this issue should be CLOSED. Edit: The model file needs to be created as it is not included in the repository. Please see this