unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Unstructured Inflexible with Client Inconsistencies

Open AndryHTC opened this issue 1 year ago • 3 comments

Describe the bug The Unstructured.io HTTP API does not accept files with the multipart/related content type. This issue occurs when uploading .mht files through a web browser, which sends files as multipart/related. However, using tools like Insomnia to send the file as application/octet-stream works correctly, even with .mht files. This inconsistency in file type handling leads to confusion and potential errors in file uploads.

To Reproduce

  1. Attempt to upload a .mht file to Unstructured.io API using a browser. Notice that the browser sets the content type to multipart/related.
  2. Observe the error: {"detail":"File type multipart/related is not supported."}
  3. Now, upload the same .mht file using Insomnia, which sets the content type to application/octet-stream.
  4. The file uploads successfully without any error.

Expected behavior The API should consistently handle .mht file uploads regardless of the client used or the content type set by the client. If multipart/related is not supported, there should be clear documentation or error messages guiding users on the appropriate file types and handling methods.

Additional context The main concern here is the inconsistency in how different clients handle file types and how Unstructured.io responds to these variations. It raises the question of whether it's best practice to avoid passing the file type altogether and if there should be a more standardized method for file uploads to ensure compatibility and ease of use.

AndryHTC avatar Jan 30 '24 12:01 AndryHTC

Thanks for flagging this - we'll take a look and get back to you soon!

awalker4 avatar Feb 06 '24 15:02 awalker4

Hey @AndryHTC we're working on getting this prioritized. Wondering if you have an example mht file we could use to reproduce? Thanks!

amanda103 avatar Feb 07 '24 18:02 amanda103

Hey @AndryHTC we're working on getting this prioritized. Wondering if you have an example mht file we could use to reproduce? Thanks!

@amanda103 unfortunately all the .mht are from our users and we cannot expose their data. I cannot replicate an opaque mht file in short time

AndryHTC avatar Feb 12 '24 11:02 AndryHTC

Hi @AndryHTC, we've made lots of improvements to filetype detection in the library, and the API is now using this logic as well. I tried downloading our main github page as a .mhtml and I'm able to partition it via unstructured-api. Let me know if you're still seeing issues!

Note that we've also added content_type as a formdata parameter to the API if you still need to override our filetype detection.

awalker4 avatar Aug 06 '24 21:08 awalker4