pydocx
pydocx copied to clipboard
Consider using lxml.objectify or equivalent and validate XML against schema definition
-
pydocx doesn't currently perform XML validation even though the schema definitions for wordml.
XML Schema validation: http://lxml.de/validation.html
"Pure python" alternative to
lxml
: http://pyxb.sourceforge.net/The schema files are all available here: http://www.ecma-international.org/publications/standards/Ecma-376.htm. Part 1 has a file called
OfficeOpenXML-XMLSchema-Strict.zip
which contains all of the relevant and necessary XML schema definition files. -
pydocx strips XML namespaces, which has the possibly effect of introducing conflicts (for tags that are named the same but in different namespaces).
-
pydocx is slowly building its own XML parser, which probably isn't what pydocx should be focusing on.
We're already moving in the direction of mapping XML to python objects. lxml provides a stable API for this: http://lxml.de/objectify.html
Not having to require lxml has a dependency would be nice, but I don't think that should be the only reason we dismiss it. Alternatively, perhaps we can find a pure-python implementation for objectify
and then detect whether to use that, or the lxml
version. Then consumers of pydocx can decide if they care about performance or a fast installation.
https://github.com/pabigot/pyxb/issues/30
http://www.davekuhlman.org/generateDS.html https://github.com/plutext/docx4j