pydocx icon indicating copy to clipboard operation
pydocx copied to clipboard

Consider using lxml.objectify or equivalent and validate XML against schema definition

Open kylegibson opened this issue 10 years ago • 2 comments

  • pydocx doesn't currently perform XML validation even though the schema definitions for wordml.

    XML Schema validation: http://lxml.de/validation.html

    "Pure python" alternative to lxml: http://pyxb.sourceforge.net/

    The schema files are all available here: http://www.ecma-international.org/publications/standards/Ecma-376.htm. Part 1 has a file called OfficeOpenXML-XMLSchema-Strict.zip which contains all of the relevant and necessary XML schema definition files.

  • pydocx strips XML namespaces, which has the possibly effect of introducing conflicts (for tags that are named the same but in different namespaces).

  • pydocx is slowly building its own XML parser, which probably isn't what pydocx should be focusing on.

    We're already moving in the direction of mapping XML to python objects. lxml provides a stable API for this: http://lxml.de/objectify.html

Not having to require lxml has a dependency would be nice, but I don't think that should be the only reason we dismiss it. Alternatively, perhaps we can find a pure-python implementation for objectify and then detect whether to use that, or the lxml version. Then consumers of pydocx can decide if they care about performance or a fast installation.

kylegibson avatar Oct 13 '14 18:10 kylegibson

https://github.com/pabigot/pyxb/issues/30

kylegibson avatar Mar 12 '15 21:03 kylegibson

http://www.davekuhlman.org/generateDS.html https://github.com/plutext/docx4j

kylegibson avatar Mar 13 '15 02:03 kylegibson