extractous icon indicating copy to clipboard operation
extractous copied to clipboard

OOXMLParser Failed Extraction

Open s4zuk3 opened this issue 11 months ago • 2 comments

Hello everyone!

I have started using Extractous to extract text and metadata from Microsoft files (doc, docx, xls, xlsx, pptx), but I have encountered several issues with the OOXMLParser, mostly related to missing classes in the Tika Native.

The errors are numerous and varied, but they all correspond to the OOXMLParser. I cannot provide the original documents that trigger the errors, but I tried to create some that replicate the different issues.

Note: To obtain the stack trace of the errors, I compressed the documents into a .zip file and processed them with Extractous. The stack trace of the issue is then present within the metadata.


  1. Missing compiled schemas:
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/chartspace36e0doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.chartspace36e0doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/comments4c11doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.comments4c11doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/drawing324ddoctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.drawing324ddoctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/visiodocumentd431doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.visiodocumentd431doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/cttblprexbasee7eetype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.cttblprexbasee7eetype) - code 0

I cannot provide the documents for these, but I resolved it locally by adding these files to the reachability-metadata.json:

https://github.com/yobix-ai/extractous/blob/main/extractous-core/tika-native/src/main/resources/META-INF/ai.yobix/tika-2.9.2-linux/reachability-metadata.json#L5317

{
  "glob": "org/apache/poi/schemas/ooxml/system/ooxml/**"
},

issue2.xlsx issue2_v2.docx (same issue)

org.graalvm.nativeimage.builder/com.oracle.svm.core.JavaMemoryUtil.copyObjectArrayForwardWithStoreCheck(JavaMemoryUtil.java:495) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.jdk.SubstrateArraycopySnippets.doArraycopy(SubstrateArraycopySnippets.java:113) at [email protected]/java.util.Arrays.copyOf(Arrays.java:3516) at [email protected]/java.util.ArrayList.toArray(ArrayList.java:401) at org.apache.xmlbeans.impl.values.XmlObjectBase.getXmlObjectArray(XmlObjectBase.java:3203) at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTDxfsImpl.getDxfArray(CTDxfsImpl.java:54) at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:268) at org.apache.poi.xssf.model.StylesTable.(StylesTable.java:159) at org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:179) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:144) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more


  1. Missing org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties. I cannot provide the file, but the issue can be fixed by adding the missing class to reachability-metadata.json.

Exception in thread "main": org.graalvm.nativeimage.MissingReflectionRegistrationError org.graalvm.nativeimage.MissingReflectionRegistrationError: The program tried to reflectively instantiate the array class

org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[]

without it being registered for runtime reflection. Add org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[] to the reflection metadata to solve this problem. Note: Add "unsafeAllocated" to the array class registration to enable runtime instantiation. See https://www.graalvm.org/latest/reference-manual/native-image/metadata/#reflection for help. at org.graalvm.nativeimage.builder/com.oracle.svm.core.reflect.MissingReflectionRegistrationUtils.errorForArray(MissingReflectionRegistrationUtils.java:121) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.snippets.SubstrateAllocationSnippets.arrayHubErrorStub(SubstrateAllocationSnippets.java:364) at org.apache.xmlbeans.impl.values.XmlObjectBase._typedArray(XmlObjectBase.java:442) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:482) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:448) at org.apache.poi.xssf.model.ParagraphPropertyFetcher.fetch(ParagraphPropertyFetcher.java:57) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.fetchParagraphProperty(XSSFTextParagraph.java:860) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.isBullet(XSSFTextParagraph.java:728) at org.apache.poi.xssf.usermodel.XSSFSimpleShape.getText(XSSFSimpleShape.java:202) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:272) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:189) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:192) at ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:87)


  1. issue4.pptx

Caused by: org.apache.poi.ooxml.POIXMLException: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.tryXSLF(OOXMLExtractorFactory.java:324) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more Caused by: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.readPackagePart(XSLFDiagramDrawing.java:47) at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.(XSLFDiagramDrawing.java:43) at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:662) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:679) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:125) ... 17 more


  1. issue5.xlsx

at org.apache.poi.xssf.model.CommentsTable.readFrom(CommentsTable.java:86) at org.apache.poi.xssf.model.CommentsTable.(CommentsTable.java:80) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.parseComments(XSSFReader.java:436) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetComments(XSSFReader.java:425) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:161) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more


  1. issue6.docx (Similar to https://github.com/yobix-ai/extractous/issues/40)

at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:469) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:297) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:230) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:146) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)

at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:272) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) xceptionStackTraceInformation

at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:63) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40) at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221)


Thank you for your work on this repository. I hope you can fix these issues and release a new version of Extractous as soon as possible.

s4zuk3 avatar Jan 08 '25 15:01 s4zuk3

Hello everyone!

I have started using Extractous to extract text and metadata from Microsoft files (doc, docx, xls, xlsx, pptx), but I have encountered several issues with the OOXMLParser, mostly related to missing classes in the Tika Native.

The errors are numerous and varied, but they all correspond to the OOXMLParser. I cannot provide the original documents that trigger the errors, but I tried to create some that replicate the different issues.

Note: To obtain the stack trace of the errors, I compressed the documents into a .zip file and processed them with Extractous. The stack trace of the issue is then present within the metadata.

  1. Missing compiled schemas:
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/chartspace36e0doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.chartspace36e0doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/comments4c11doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.comments4c11doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/drawing324ddoctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.drawing324ddoctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/visiodocumentd431doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.visiodocumentd431doctype) - code 0
  • Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/cttblprexbasee7eetype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.cttblprexbasee7eetype) - code 0

I cannot provide the documents for these, but I resolved it locally by adding these files to the reachability-metadata.json:

https://github.com/yobix-ai/extractous/blob/main/extractous-core/tika-native/src/main/resources/META-INF/ai.yobix/tika-2.9.2-linux/reachability-metadata.json#L5317

{ "glob": "org/apache/poi/schemas/ooxml/system/ooxml/**" },

issue2.xlsx issue2_v2.docx (same issue)

org.graalvm.nativeimage.builder/com.oracle.svm.core.JavaMemoryUtil.copyObjectArrayForwardWithStoreCheck(JavaMemoryUtil.java:495) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.jdk.SubstrateArraycopySnippets.doArraycopy(SubstrateArraycopySnippets.java:113) at [email protected]/java.util.Arrays.copyOf(Arrays.java:3516) at [email protected]/java.util.ArrayList.toArray(ArrayList.java:401) at org.apache.xmlbeans.impl.values.XmlObjectBase.getXmlObjectArray(XmlObjectBase.java:3203) at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTDxfsImpl.getDxfArray(CTDxfsImpl.java:54) at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:268) at org.apache.poi.xssf.model.StylesTable.(StylesTable.java:159) at org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:179) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:144) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more

  1. Missing org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties. I cannot provide the file, but the issue can be fixed by adding the missing class to reachability-metadata.json.

Exception in thread "main": org.graalvm.nativeimage.MissingReflectionRegistrationError org.graalvm.nativeimage.MissingReflectionRegistrationError: The program tried to reflectively instantiate the array class

org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[]

without it being registered for runtime reflection. Add org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[] to the reflection metadata to solve this problem. Note: Add "unsafeAllocated" to the array class registration to enable runtime instantiation. See https://www.graalvm.org/latest/reference-manual/native-image/metadata/#reflection for help. at org.graalvm.nativeimage.builder/com.oracle.svm.core.reflect.MissingReflectionRegistrationUtils.errorForArray(MissingReflectionRegistrationUtils.java:121) at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.snippets.SubstrateAllocationSnippets.arrayHubErrorStub(SubstrateAllocationSnippets.java:364) at org.apache.xmlbeans.impl.values.XmlObjectBase._typedArray(XmlObjectBase.java:442) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:482) at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:448) at org.apache.poi.xssf.model.ParagraphPropertyFetcher.fetch(ParagraphPropertyFetcher.java:57) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.fetchParagraphProperty(XSSFTextParagraph.java:860) at org.apache.poi.xssf.usermodel.XSSFTextParagraph.isBullet(XSSFTextParagraph.java:728) at org.apache.poi.xssf.usermodel.XSSFSimpleShape.getText(XSSFSimpleShape.java:202) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:272) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:189) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:192) at ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:87)

  1. issue4.pptx

Caused by: org.apache.poi.ooxml.POIXMLException: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.tryXSLF(OOXMLExtractorFactory.java:324) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more Caused by: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.readPackagePart(XSLFDiagramDrawing.java:47) at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.(XSLFDiagramDrawing.java:43) at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:662) at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:679) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:125) ... 17 more

  1. issue5.xlsx

at org.apache.poi.xssf.model.CommentsTable.readFrom(CommentsTable.java:86) at org.apache.poi.xssf.model.CommentsTable.(CommentsTable.java:80) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.parseComments(XSSFReader.java:436) at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetComments(XSSFReader.java:425) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:161) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 13 more

  1. issue6.docx (Similar to When a document(.doc) contains a Visio graphic, the extraction fails #40)

at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:469) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:297) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:230) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:146) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)

at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:272) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) xceptionStackTraceInformation

at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:63) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40) at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221)

Thank you for your work on this repository. I hope you can fix these issues and release a new version of Extractous as soon as possible.

Hi: I meet some errors similar to this. I solved by update the latest tika version for test. may be you can have a try.

https://gitee.com/mrlijing/extractous

[patch.crates-io] extractous = {git = "https://gitee.com/mrlijing/extractous"}

lijingrs avatar Feb 23 '25 00:02 lijingrs

[patch.crates-io] extractous = {git = "https://gitee.com/mrlijing/extractous"}

Hi! I use the override, but for me all most of ".docx" file didn't work.

RuofengX avatar Jun 26 '25 04:06 RuofengX