delb-py
delb-py copied to clipboard
Serializations for mixed content documents
so, i'm mostly done with what took its departure in #54. given the lengths that this was on my desk and in my drawers and several moments where i was under the impression that what i seeked wasn't sanely doable, i'm very happy to eventually be at that point.
a review of these changes are imo sufficient by studying and criticising:
- the related documentation chapter
- the related test module (don't hesitate to demand clarifications!)
- the related integration test (which is supposed to do the same as this test, but with fewer options and arranged in an efficient manner)
my guess is that the latter one is functioning as it continuously yielded then fixed errors on each code iteration from the 360k something documents with a total volume of ~4.1GB. to be explicit: all these documents were parsed, a non-altered and two whitespace-altering variants were produced, these were each reparsed (where the latter two received whitespace normalization as per TEI recommendation) and finally successfully compared against the originating documents.
(just two unimportant insights from the process: if one had tried to achieve that based on lxml's data model they'd certainly gone nuts and the := operator can be a super powerful tool for concise expressions; what was all the fuzz about?)
anyway, don't look to much on the implementation. it's architecture is fundamentally wrong (we really need an event based writer and some state machinish connectors) and inefficient.
but the current structure allows targeted debugging, that's what i did at length. and i would consider this as kind of a breakthrough (showing what is possible) and the establishment of a Distinktionsmerkmal for libraries that operate on the basic level. in that regard, you can pitch me other suited libraries (regardless their language) to include them in the comparison.
hence i'd say the implementation is good enough to move on.
i promise not to force-push to this branch. but i may consolidate and merge it locally at the end.
please contact me directly if you'd like an in-person discussion.