Bundestagger icon indicating copy to clipboard operation
Bundestagger copied to clipboard

Line breaks processed incorrectly

Open loewis opened this issue 15 years ago • 1 comments

Apparently, bundestagger fails to join lines correctly. Looking at

http://www.bundestagger.de/17/sitzung/12/

I see phrases like "auch für uns alle eineneue Ära begonnen" (missing space between eine and neue), and "dass wir an dieserneuen Vertragsgrundlage" (missing space between dieser and neuen). OTOH, superfluous hyphenation is not removed, e.g. in "auf die gro-ßen politischen".

loewis avatar Dec 24 '10 00:12 loewis

These problems are of course artifacts from the PDF extraction (which is broken at the moment). Some of these problems may be fixed by a better extraction method: Using the XML-export feature of a commercial PDF solution apparently solves the hyphenation errors. However, even this commercial software still produces many missing spaces, PDF is a somewhat lossy format. I also tried fixing missing spaces with a Python spellcheck library, but that turned out to be not foolproof enough. Bundestagger needs a new PDF parser anyways and if that one works better, I'm going to try to fix older parliament protocols as well. This task (together with a redesign) has been lying around for some time now.

Another approach is lobbying for an additional protocol format from bundestag.de. There have been talks with Dr. Maika Jachmann to convince her that publishing even a .doc file would make life easier. Apparently no luck so far.

When you are logged in at Bundestagger (an OpenID suffices), you can suggest fixes for the text.

stefanw avatar Dec 24 '10 09:12 stefanw