Line breaks processed incorrectly
Apparently, bundestagger fails to join lines correctly. Looking at
http://www.bundestagger.de/17/sitzung/12/
I see phrases like "auch für uns alle eineneue Ära begonnen" (missing space between eine and neue), and "dass wir an dieserneuen Vertragsgrundlage" (missing space between dieser and neuen). OTOH, superfluous hyphenation is not removed, e.g. in "auf die gro-ßen politischen".
These problems are of course artifacts from the PDF extraction (which is broken at the moment). Some of these problems may be fixed by a better extraction method: Using the XML-export feature of a commercial PDF solution apparently solves the hyphenation errors. However, even this commercial software still produces many missing spaces, PDF is a somewhat lossy format. I also tried fixing missing spaces with a Python spellcheck library, but that turned out to be not foolproof enough. Bundestagger needs a new PDF parser anyways and if that one works better, I'm going to try to fix older parliament protocols as well. This task (together with a redesign) has been lying around for some time now.
Another approach is lobbying for an additional protocol format from bundestag.de. There have been talks with Dr. Maika Jachmann to convince her that publishing even a .doc file would make life easier. Apparently no luck so far.
When you are logged in at Bundestagger (an OpenID suffices), you can suggest fixes for the text.