OOXMLParser Failed Extraction
Hello everyone!
I have started using Extractous to extract text and metadata from Microsoft files (doc, docx, xls, xlsx, pptx), but I have encountered several issues with the OOXMLParser, mostly related to missing classes in the Tika Native.
The errors are numerous and varied, but they all correspond to the OOXMLParser. I cannot provide the original documents that trigger the errors, but I tried to create some that replicate the different issues.
Note: To obtain the stack trace of the errors, I compressed the documents into a .zip file and processed them with Extractous. The stack trace of the issue is then present within the metadata.
- Missing compiled schemas:
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/chartspace36e0doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.chartspace36e0doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/comments4c11doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.comments4c11doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/drawing324ddoctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.drawing324ddoctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/visiodocumentd431doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.visiodocumentd431doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/cttblprexbasee7eetype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.cttblprexbasee7eetype) - code 0
I cannot provide the documents for these, but I resolved it locally by adding these files to the reachability-metadata.json:
https://github.com/yobix-ai/extractous/blob/main/extractous-core/tika-native/src/main/resources/META-INF/ai.yobix/tika-2.9.2-linux/reachability-metadata.json#L5317
{
"glob": "org/apache/poi/schemas/ooxml/system/ooxml/**"
},
issue2.xlsx issue2_v2.docx (same issue)
org.graalvm.nativeimage.builder/com.oracle.svm.core.JavaMemoryUtil.copyObjectArrayForwardWithStoreCheck(JavaMemoryUtil.java:495)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.jdk.SubstrateArraycopySnippets.doArraycopy(SubstrateArraycopySnippets.java:113)
at [email protected]/java.util.Arrays.copyOf(Arrays.java:3516)
at [email protected]/java.util.ArrayList.toArray(ArrayList.java:401)
at org.apache.xmlbeans.impl.values.XmlObjectBase.getXmlObjectArray(XmlObjectBase.java:3203)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTDxfsImpl.getDxfArray(CTDxfsImpl.java:54)
at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:268)
at org.apache.poi.xssf.model.StylesTable.
- Missing org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties. I cannot provide the file, but the issue can be fixed by adding the missing class to reachability-metadata.json.
Exception in thread "main": org.graalvm.nativeimage.MissingReflectionRegistrationError org.graalvm.nativeimage.MissingReflectionRegistrationError: The program tried to reflectively instantiate the array class
org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[]
without it being registered for runtime reflection. Add org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[] to the reflection metadata to solve this problem. Note: Add "unsafeAllocated" to the array class registration to enable runtime instantiation. See https://www.graalvm.org/latest/reference-manual/native-image/metadata/#reflection for help. at org.graalvm.nativeimage.builder/com.oracle.svm.core.reflect.MissingReflectionRegistrationUtils.errorForArray(MissingReflectionRegistrationUtils.java:121) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.snippets.SubstrateAllocationSnippets.arrayHubErrorStub(SubstrateAllocationSnippets.java:364) at org.apache.xmlbeans.impl.values.XmlObjectBase._typedArray(XmlObjectBase.java:442) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:482) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:448) at org.apache.poi.xssf.model.ParagraphPropertyFetcher.fetch(ParagraphPropertyFetcher.java:57) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.fetchParagraphProperty(XSSFTextParagraph.java:860) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.isBullet(XSSFTextParagraph.java:728) at org.apache.poi.xssf.usermodel.XSSFSimpleShape.getText(XSSFSimpleShape.java:202) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:272) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:189) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:192) at ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:87)
Caused by: org.apache.poi.ooxml.POIXMLException: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation
at org.apache.poi.xslf.usermodel.XMLSlideShow.
at org.apache.poi.xssf.model.CommentsTable.readFrom(CommentsTable.java:86)
at org.apache.poi.xssf.model.CommentsTable.
- issue6.docx (Similar to https://github.com/yobix-ai/extractous/issues/40)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:469) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:297) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:230) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:146) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:272) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) xceptionStackTraceInformation
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.
Thank you for your work on this repository. I hope you can fix these issues and release a new version of Extractous as soon as possible.
Hello everyone!
I have started using Extractous to extract text and metadata from Microsoft files (doc, docx, xls, xlsx, pptx), but I have encountered several issues with the OOXMLParser, mostly related to missing classes in the Tika Native.
The errors are numerous and varied, but they all correspond to the OOXMLParser. I cannot provide the original documents that trigger the errors, but I tried to create some that replicate the different issues.
Note: To obtain the stack trace of the errors, I compressed the documents into a .zip file and processed them with Extractous. The stack trace of the issue is then present within the metadata.
- Missing compiled schemas:
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/chartspace36e0doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.chartspace36e0doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/comments4c11doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.comments4c11doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/drawing324ddoctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.drawing324ddoctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/visiodocumentd431doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.visiodocumentd431doctype) - code 0
- Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/cttblprexbasee7eetype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.cttblprexbasee7eetype) - code 0
I cannot provide the documents for these, but I resolved it locally by adding these files to the reachability-metadata.json:
https://github.com/yobix-ai/extractous/blob/main/extractous-core/tika-native/src/main/resources/META-INF/ai.yobix/tika-2.9.2-linux/reachability-metadata.json#L5317
{ "glob": "org/apache/poi/schemas/ooxml/system/ooxml/**" },
issue2.xlsx issue2_v2.docx (same issue)
org.graalvm.nativeimage.builder/com.oracle.svm.core.JavaMemoryUtil.copyObjectArrayForwardWithStoreCheck(JavaMemoryUtil.java:495) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.jdk.SubstrateArraycopySnippets.doArraycopy(SubstrateArraycopySnippets.java:113) at [email protected]/java.util.Arrays.copyOf(Arrays.java:3516) at [email protected]/java.util.ArrayList.toArray(ArrayList.java:401) at org.apache.xmlbeans.impl.values.XmlObjectBase.getXmlObjectArray(XmlObjectBase.java:3203) at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTDxfsImpl.getDxfArray(CTDxfsImpl.java:54) at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:268) at org.apache.poi.xssf.model.StylesTable.(StylesTable.java:159) at org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:179) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:144) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more
- Missing org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties. I cannot provide the file, but the issue can be fixed by adding the missing class to reachability-metadata.json.
Exception in thread "main": org.graalvm.nativeimage.MissingReflectionRegistrationError org.graalvm.nativeimage.MissingReflectionRegistrationError: The program tried to reflectively instantiate the array class
org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[]
without it being registered for runtime reflection. Add org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[] to the reflection metadata to solve this problem. Note: Add "unsafeAllocated" to the array class registration to enable runtime instantiation. See https://www.graalvm.org/latest/reference-manual/native-image/metadata/#reflection for help. at org.graalvm.nativeimage.builder/com.oracle.svm.core.reflect.MissingReflectionRegistrationUtils.errorForArray(MissingReflectionRegistrationUtils.java:121) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.snippets.SubstrateAllocationSnippets.arrayHubErrorStub(SubstrateAllocationSnippets.java:364) at org.apache.xmlbeans.impl.values.XmlObjectBase._typedArray(XmlObjectBase.java:442) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:482) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:448) at org.apache.poi.xssf.model.ParagraphPropertyFetcher.fetch(ParagraphPropertyFetcher.java:57) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.fetchParagraphProperty(XSSFTextParagraph.java:860) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.isBullet(XSSFTextParagraph.java:728) at org.apache.poi.xssf.usermodel.XSSFSimpleShape.getText(XSSFSimpleShape.java:202) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:272) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:189) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:192) at ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:87)
Caused by: org.apache.poi.ooxml.POIXMLException: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.tryXSLF(OOXMLExtractorFactory.java:324) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more Caused by: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.readPackagePart(XSLFDiagramDrawing.java:47) at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.(XSLFDiagramDrawing.java:43) at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:662) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:679) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:125) ... 17 more
at org.apache.poi.xssf.model.CommentsTable.readFrom(CommentsTable.java:86) at org.apache.poi.xssf.model.CommentsTable.(CommentsTable.java:80) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.parseComments(XSSFReader.java:436) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetComments(XSSFReader.java:425) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:161) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:469) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:297) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:230) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:146) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:272) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) xceptionStackTraceInformation
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:63) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40) at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221)
Thank you for your work on this repository. I hope you can fix these issues and release a new version of Extractous as soon as possible.
Hi: I meet some errors similar to this. I solved by update the latest tika version for test. may be you can have a try.
https://gitee.com/mrlijing/extractous
[patch.crates-io] extractous = {git = "https://gitee.com/mrlijing/extractous"}
[patch.crates-io] extractous = {git = "https://gitee.com/mrlijing/extractous"}
Hi! I use the override, but for me all most of ".docx" file didn't work.