mediawiki-parser
mediawiki-parser copied to clipboard
Document what the text post-processor is for
Looking at the post-processor under text.py I don't fully understand what its purpose is. Is it designed to produce nice and human readable output (but then why are tags with attributes preserved?) or just to strip off wiki markup?
I am looking for a transformation to get text only, mostly for doing processing with NLTK later on. Something like ** for bold texts might be agreeable.
Thanks for giving the library a spin! The text renderer is meant to output a human-readable textual representation. If it's spitting out tags and attrs, then that's a bug, and I'd be happy to take patches against it.
If you want to customize the output, you can use raw.py instead, giving you a raw AST to play with.
Incidentally, at some unspecified point in the future, I'm going to finish Parsimonious (https://github.com/erikrose/parsimonious/) and port the MW grammar to that, at which time I'll start ignoring this.
Yes, the text post-processor is designed to produce nice and human readable output.
As for tags, in the HTML post-processor, you have two kinds of tags: allowed and disallowed. By default, all tags are disallowed. In this case, they are treated as "normal" text, that's why "" is rendered as "": by default, it is not a tag. "Allowed" tags are interpreted when they are implemented (like <p>
, <br/>
...). In this case, they don't appear anymore in the output.
In the text post-processor, you can't currently define which tags are allowed or disallowed. They are all treated as text, except <p>
and <br />
which will be interpreted as new paragraph and line break.
I think we can make a better output with the text renderer. I spent some time looking at how we can adapt the HTML renderer for this purpose. It's quite long to do and I don't have the time right now, but please feel free to propose improvements if you want to.
Finally, I felt inspired. I proposed a first version of a new text post-processor based on the HTML one. Please feel free to test it and propose improvements.