wtf_wikipedia icon indicating copy to clipboard operation
wtf_wikipedia copied to clipboard

Support <ref group="note">

Open fractalien opened this issue 6 years ago • 6 comments

In London / EN, there is a section <ref group="note">...</ref> of which the parser keeps the starting tag and the content.

fractalien avatar Feb 28 '18 23:02 fractalien

My understanding is that in order for this to work without breaking anything, we'll have to drop xml after parsing the citations. But that seems complicated since that is done inside the sections processing section.

dmh-cs avatar Aug 06 '18 21:08 dmh-cs

yeah, that's correct. we're getting burned by order-of-operations stuff all-over the place. @fractalien i can't find it, am I missing something?

wtf.fetch('London', 'en', function(err, doc) {
  console.log(doc.plaintext().match('"ref"'));
});

cheers

spencermountain avatar Aug 07 '18 16:08 spencermountain

closing until this can be reproduced

spencermountain avatar Sep 19 '18 14:09 spencermountain

a workaround for this can be a replacement of REF XML-tags by a kind of text token ___REF_GROUP_note____ and the parser will regard that as ordinary text and it remains even in the plain text as output.

<ref group="note">

Even in plain text finally the a replacement of citations into "[1]" can be perform without any need to alter kill_xml(). Of course it is a hack to generation of an AST tree node for data, that we want to preserve. Furthermore those type of tokens will not cause any conflict with any other parsing steps. We could replace the underscore by another character wrapping the ref-data we want to preserve, as long as it does not creates any conflict with existing syntax of the wiki source.

If Spencer is Ok with implementing such a workaround and preserve the current order of parsing and killing XML. cheers

niebert avatar Sep 20 '18 14:09 niebert

ah, sorry I misunderstood this issue. I didn't know wikipedia had a special thing for references-as-notes.

yeah, niebert's strategy for doing this a-priori on the string would work. You could also pre-match them, and store offsets somehow. I have been shy about doing these, as we're throwing-around, and changing wikitext all-over the place.

There's no problem parsing these notes, storing them in doc.references, and rendering them somehow. Happy to do it.

spencermountain avatar Sep 21 '18 17:09 spencermountain

I'm sorry for the late reply – only now got to work on my project again. The tag is still there, but it's cleanly absent from the sentences' text now. Thanks for whichever other measure fixed it!

fractalien avatar Feb 11 '19 20:02 fractalien