haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`)

Open sjrl opened this issue 10 months ago • 1 comments

Is your feature request related to a problem? Please describe. Basically, the problem is the converter components in Haystack (e.g. DOCXToDocument, XLSXToDocument, etc.) don’t support the legacy Microsoft office file types (e.g. .doc, .xls, .ppt). This is because the underlying libraries we use in Haystack only support the modern microsoft office doc types.

After some online research I was unable to find other python libraries with permissive licenses that could support the conversion of these older formats.

Describe the solution you'd like Instead, a common recommendation to handle legacy files is to convert them to the modern ones (e.g. .doc to .docx) using the command line tool from libreoffice (more info here). For example,

soffice --headless --convert-to docx  test.doc

So I think creating a new converter component that converts the legacy format into the modern one would be great! We could also potentially leverage passing the output as a ByteStream so we could maybe avoid writing temporary files. Perhaps it would make sense to make this behavior controllable via input parameters.

As a side-effect this component would also allow for the conversion of microsoft file types (and others) into formats such as PDF which may be helpful in scenarios such as running OCR or having more reliable page detections for .docx files.

Describe alternatives you've considered

  • It appears that Tika could also cover some of these cases. See parser docs here. However, it's not 100% clear to me if that's true and I think it would be nice to allow users to leverage our other converters without needing to deploy Tika.
  • Technically Unstructured IO also supports these legacy formats. See docs here. However, I say technically since their strategy is also to use libreoffice to convert the legacy formats to the modern ones and then leverage their other converters. So I believe it would be better for us to also natively support the libreoffice conversion.

sjrl avatar Feb 03 '25 09:02 sjrl

@julian-risch to provide some additional context. We will have an alternative solution implemented for deepset Cloud so this is not needed to unblock client work. However, I think this would still be immensely useful for Haystack users and could be something we leverage in deepset Cloud in the future.

sjrl avatar Feb 10 '25 12:02 sjrl