mediawiki-parser icon indicating copy to clipboard operation
mediawiki-parser copied to clipboard

Document what the text post-processor is for

Open cburgmer opened this issue 12 years ago • 3 comments

Looking at the post-processor under text.py I don't fully understand what its purpose is. Is it designed to produce nice and human readable output (but then why are tags with attributes preserved?) or just to strip off wiki markup?

I am looking for a transformation to get text only, mostly for doing processing with NLTK later on. Something like ** for bold texts might be agreeable.

cburgmer avatar Apr 17 '12 21:04 cburgmer

Thanks for giving the library a spin! The text renderer is meant to output a human-readable textual representation. If it's spitting out tags and attrs, then that's a bug, and I'd be happy to take patches against it.

If you want to customize the output, you can use raw.py instead, giving you a raw AST to play with.

Incidentally, at some unspecified point in the future, I'm going to finish Parsimonious (https://github.com/erikrose/parsimonious/) and port the MW grammar to that, at which time I'll start ignoring this.

erikrose avatar Apr 17 '12 21:04 erikrose

Yes, the text post-processor is designed to produce nice and human readable output.

As for tags, in the HTML post-processor, you have two kinds of tags: allowed and disallowed. By default, all tags are disallowed. In this case, they are treated as "normal" text, that's why "" is rendered as "": by default, it is not a tag. "Allowed" tags are interpreted when they are implemented (like <p>, <br/>...). In this case, they don't appear anymore in the output.

In the text post-processor, you can't currently define which tags are allowed or disallowed. They are all treated as text, except <p> and <br /> which will be interpreted as new paragraph and line break.

I think we can make a better output with the text renderer. I spent some time looking at how we can adapt the HTML renderer for this purpose. It's quite long to do and I don't have the time right now, but please feel free to propose improvements if you want to.

peter17 avatar Apr 26 '12 15:04 peter17

Finally, I felt inspired. I proposed a first version of a new text post-processor based on the HTML one. Please feel free to test it and propose improvements.

peter17 avatar Apr 26 '12 21:04 peter17