revscoring
revscoring copied to clipboard
extracting features from xml dump
Hi,
Thank you for you work on this package. For research purpose, I would like to get features (and eventually reproduce classification) on the entire XML dump of french wiki (20181101 for instance). Of course, this can hardly be done with API queries.
Is there a way to extract feature while parsing XML dump, for instance with mediawiki-utilities :) I can imagine that it can by done by changing this line in the example code :
extractor = Extractor(mwapi.Session(host="https://en.wikipedia.org",
user_agent="revscoring demo"))
but not being a Python star (more like a R guy !), I'm quite confused. Can you show me just a little example of how to parse for instance 5 first revisions of a little dump file ?
Thank you again for this work.
I'm currently also looking into this (and the source code) and there seems to be an OfflineExtractor
which can use Datasources
, according to a test in test_offline_extractor()
. However, I'm still not quite sure if A) this can be done using revscoring and B) how to do it.
It would be great if someone could just clarify on this issue, @halfak maybe? Even just knowing if it is possible would be great.
Pulling revisions using the api works great, however it is kinda slow when retrieving lots of data (and probably also quite taxing on the wikimedia servers), so an offline solution which can target dumps would be great.
thanks!
Hey! Sorry for missing this. Didn't see the notification come by.
So the way to gather input data for a model depends on the model. E.g. the https://github.com/wikimedia/editquality models require a lot of different bits of information about the article, the user making the edit, etc. I'll leave that complicated case for further questioning if someone is interested in that direction.
A more simple case is the model in https://gtihub.com/wikimedia/articlequality. These models only extract features from the text of the article. Assuming you've loaded the model
into memory, you can use the following code to experiment with providing it data directly.
from revscoring.dependencies import solve
from revscoring.datasources.revision_oriented import revision
score_doc = model.score(list(solve(model.features, cache={revision.text: "Some text to score"})))
The trick here is to run the solve()
function and provide it a cache
value that includes all of the dependencies necessary for feature extraction.
If I were processing all of the scores in an XML dump, I'd make use of the mwxml
library and do something like this:
import mwxml
from revscoring import Model
from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented as ro
model = Model.load(open(<path to model>))
dump_file_paths = <list of XML dump file paths>
def process_dump(dump):
for page in dump:
for revision in page:
score_doc = model.score(list(solve(model.features, cache={ro.revision.text: revision.text})))
yield page, revision score
for page, revision, score in mwxml.map(process_dump, dump_file_paths):
print(page.id, revision.id, score)
Hi @halfak
Thanks a lot for the code, got the articlequality
classifier to run for the English Wikipedia! Just one comment for anyone else adapting the code: process_dump(dump)
should be process_dump(dump, path)
instead for it to run correctly.
I am also quite interested in getting the editquality
model to work. Would you be so kind to also give me a few pointers where to look for information regarding this? Unfortunately, I didn't manage to find anything in the editquality
or revscoring
documentation.
thanks, thorsten
Running the editquality
model from the XML dumps is much more difficult because it requires features that exist outside of the page and revisions in the dump. E.g. you need to know what user groups a user belongs to and how long ago they registered their account. It would be possible to run a hybrid approach where you use the XML dumps to gather the text of revisions and the API to gather user information. If that sounds useful, I can write up a gist for how I'd approach doing something like that.
I have worked with revscoring
feature retrieval before and would be definitely interested in such a gist, if you would have time for that!
On a separate note, do you think this process would still be more efficient than directly querying ORES, let's say for the damaging
or goodfaith
models? I'm just trying to figure out the most efficient way to process a large number of revisions.
thanks
If you're looking to get ~5-10 million scores or less, querying ORES directly is pretty reasonable and it should take 24-48 hours using the oresapi
utility.
If you need more scores than that, then I think we'd need to get clever. How many do you need?
hi,
We are probably going to probably need more queries than that, but I think we are going to give it a try using oresapi
.
Right now, I tried using the oresapi
and it indeed performs quite well when running without errors. However, i quite frequently get the following runtime error for multiple batches of revisions: RuntimeError: {'code': 'too_many_requests', 'message': 'A limited number of parallel connections per IP is allowed.'}
I use the standard session object configuration, which is retries=5, batch_size=50, parallel_requests=4
. The error message also occurs when using less parallel_requests
(e.g. 2)
Is there anything configuration-wise I could do to prevent this?
Ping @chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.
Thanks all I'll dig in.
thanks for the heads up
On Mon, Aug 10, 2020 at 6:20 AM Aaron Halfaker [email protected] wrote:
Ping @chrisalbon https://github.com/chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wikimedia/revscoring/issues/420#issuecomment-671349633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3R5E6CAFH67TOOVS6CCDLR77XYVANCNFSM4GMLYN2Q .
I'm wondering whether this could be due to some previous requests (which were canceled and/or crashed on my machine) were still running on the server?
I tried running it again today and this time it finished without any error messages.
Am Mo., 10. Aug. 2020 um 18:05 Uhr schrieb Chris Albon < [email protected]>:
thanks for the heads up
On Mon, Aug 10, 2020 at 6:20 AM Aaron Halfaker [email protected] wrote:
Ping @chrisalbon https://github.com/chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/wikimedia/revscoring/issues/420#issuecomment-671349633 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA3R5E6CAFH67TOOVS6CCDLR77XYVANCNFSM4GMLYN2Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wikimedia/revscoring/issues/420#issuecomment-671444601, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6KAGXVEJTQBNKOMTKPVDSAALDVANCNFSM4GMLYN2Q .
Yes, that's most likely the case. We have controls to prevent too many open connections. If you try to run oresapi
twice in parallel, you can expect to get this type of error.