revscoring extracting features from xml dump

Hi,

Thank you for you work on this package. For research purpose, I would like to get features (and eventually reproduce classification) on the entire XML dump of french wiki (20181101 for instance). Of course, this can hardly be done with API queries.

Is there a way to extract feature while parsing XML dump, for instance with mediawiki-utilities :) I can imagine that it can by done by changing this line in the example code :

  extractor = Extractor(mwapi.Session(host="https://en.wikipedia.org",
                                          user_agent="revscoring demo"))

but not being a Python star (more like a R guy !), I'm quite confused. Can you show me just a little example of how to parse for instance 5 first revisions of a little dump file ?

Thank you again for this work.

Dec 27 '18 16:12 leojoubert

I'm currently also looking into this (and the source code) and there seems to be an OfflineExtractor which can use Datasources, according to a test in test_offline_extractor(). However, I'm still not quite sure if A) this can be done using revscoring and B) how to do it.

It would be great if someone could just clarify on this issue, @halfak maybe? Even just knowing if it is possible would be great.

Pulling revisions using the api works great, however it is kinda slow when retrieving lots of data (and probably also quite taxing on the wikimedia servers), so an offline solution which can target dumps would be great.

thanks!

Jul 08 '19 07:07 ruptho

Hey! Sorry for missing this. Didn't see the notification come by.

So the way to gather input data for a model depends on the model. E.g. the https://github.com/wikimedia/editquality models require a lot of different bits of information about the article, the user making the edit, etc. I'll leave that complicated case for further questioning if someone is interested in that direction.

A more simple case is the model in https://gtihub.com/wikimedia/articlequality. These models only extract features from the text of the article. Assuming you've loaded the model into memory, you can use the following code to experiment with providing it data directly.

from revscoring.dependencies import solve
from revscoring.datasources.revision_oriented import revision

score_doc = model.score(list(solve(model.features, cache={revision.text: "Some text to score"})))

The trick here is to run the solve() function and provide it a cache value that includes all of the dependencies necessary for feature extraction.

If I were processing all of the scores in an XML dump, I'd make use of the mwxml library and do something like this:

import mwxml

from revscoring import Model
from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented as ro

model = Model.load(open(<path to model>))
dump_file_paths = <list of XML dump file paths>

def process_dump(dump):
    for page in dump:
        for revision in page:
            score_doc = model.score(list(solve(model.features, cache={ro.revision.text: revision.text})))
            yield page, revision score

for page, revision, score in mwxml.map(process_dump, dump_file_paths):
    print(page.id, revision.id, score)

May 20 '20 15:05 halfak

Hi @halfak

Thanks a lot for the code, got the articlequality classifier to run for the English Wikipedia! Just one comment for anyone else adapting the code: process_dump(dump) should be process_dump(dump, path) instead for it to run correctly.

I am also quite interested in getting the editquality model to work. Would you be so kind to also give me a few pointers where to look for information regarding this? Unfortunately, I didn't manage to find anything in the editquality or revscoring documentation.

thanks, thorsten

Aug 03 '20 16:08 ruptho

Running the editquality model from the XML dumps is much more difficult because it requires features that exist outside of the page and revisions in the dump. E.g. you need to know what user groups a user belongs to and how long ago they registered their account. It would be possible to run a hybrid approach where you use the XML dumps to gather the text of revisions and the API to gather user information. If that sounds useful, I can write up a gist for how I'd approach doing something like that.

Aug 04 '20 21:08 halfak

I have worked with revscoring feature retrieval before and would be definitely interested in such a gist, if you would have time for that!

On a separate note, do you think this process would still be more efficient than directly querying ORES, let's say for the damaging or goodfaith models? I'm just trying to figure out the most efficient way to process a large number of revisions.

thanks

Aug 05 '20 06:08 ruptho

If you're looking to get ~5-10 million scores or less, querying ORES directly is pretty reasonable and it should take 24-48 hours using the oresapi utility.

If you need more scores than that, then I think we'd need to get clever. How many do you need?

Aug 05 '20 14:08 halfak

hi,

We are probably going to probably need more queries than that, but I think we are going to give it a try using oresapi.

Right now, I tried using the oresapi and it indeed performs quite well when running without errors. However, i quite frequently get the following runtime error for multiple batches of revisions: RuntimeError: {'code': 'too_many_requests', 'message': 'A limited number of parallel connections per IP is allowed.'}

I use the standard session object configuration, which is retries=5, batch_size=50, parallel_requests=4. The error message also occurs when using less parallel_requests (e.g. 2)

Is there anything configuration-wise I could do to prevent this?

Aug 10 '20 12:08 ruptho

Ping @chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.

Aug 10 '20 13:08 halfak

Thanks all I'll dig in.

Aug 10 '20 16:08 chrisalbon

thanks for the heads up

On Mon, Aug 10, 2020 at 6:20 AM Aaron Halfaker [email protected] wrote:

Ping @chrisalbon https://github.com/chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wikimedia/revscoring/issues/420#issuecomment-671349633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3R5E6CAFH67TOOVS6CCDLR77XYVANCNFSM4GMLYN2Q .

Aug 10 '20 16:08 chrisalbon

I'm wondering whether this could be due to some previous requests (which were canceled and/or crashed on my machine) were still running on the server?

I tried running it again today and this time it finished without any error messages.

Am Mo., 10. Aug. 2020 um 18:05 Uhr schrieb Chris Albon < [email protected]>:

thanks for the heads up

On Mon, Aug 10, 2020 at 6:20 AM Aaron Halfaker [email protected] wrote:

Ping @chrisalbon https://github.com/chrisalbon. I'm not sure what is going on, but the defaults should work and have been used by many people before. I don't have time to run a test now, but maybe someone on the team can check it out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/wikimedia/revscoring/issues/420#issuecomment-671349633 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA3R5E6CAFH67TOOVS6CCDLR77XYVANCNFSM4GMLYN2Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wikimedia/revscoring/issues/420#issuecomment-671444601, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6KAGXVEJTQBNKOMTKPVDSAALDVANCNFSM4GMLYN2Q .

Aug 11 '20 15:08 ruptho

Yes, that's most likely the case. We have controls to prevent too many open connections. If you try to run oresapi twice in parallel, you can expect to get this type of error.

Aug 11 '20 15:08 halfak

revscoring revscoring copied to clipboard

extracting features from xml dump

revscoring
revscoring copied to clipboard