unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/file extensions supersede `content_type` when partitioning via API

Open MthwRobinson opened this issue 1 year ago • 6 comments

Describe the bug When using partition_via_api, the file extension for file_filename supersedes the content_type that the user passes in.

To Reproduce The following results in a 400 from the API

file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"

file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
    file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()

But the following does not:

The following produces an error:

file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"

file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
    file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()

Expected behavior Both should successfully process the file as plain text.

Desktop (please complete the following information):

  • unstructured==0.6.3
  • Python version 3.8

Additional context

  • Initially reported by a langchain user here.

MthwRobinson avatar May 09 '23 16:05 MthwRobinson

close this?

SimranJha2408 avatar Dec 18 '23 05:12 SimranJha2408

Closing because over 180 days old. Please reopen with a comment if the issue is still relevant.

orlandounstructured avatar Feb 08 '24 19:02 orlandounstructured

Reopened by eng request; on current sprint

orlandounstructured avatar Feb 08 '24 21:02 orlandounstructured

@MthwRobinson

Ok, so I started looking into the issue and I'm not sure if it's still relevant. The version I'm testing it on is 0.12.5, and although partition_via_api accepts content_type argument it is not used anywhere inside the function.

Using given example I wasn't able to reproduce the issue. Also, both of the given examples are the same and one is supposed to throw 400 the other one an error (???). I suppose maybe one of them shouldn't use content_type argument. I checked that and it was also working.

Then I checked by executing it directly using partition_via_api and It gives the same results as previous tests.

As for the unused content_type is this expected behavior? I noticed in similar function partition_multiple_via_api it takes content_types argument and in there this argument is used.

mpolomdeepsense avatar Feb 28 '24 15:02 mpolomdeepsense

As for the unused content_type is this expected behavior? I noticed in similar function partition_multiple_via_api it takes content_types argument and in there this argument is used.

No, it should always be possible to explicitly specify the file(s) content_type.

cragwolfe avatar Mar 04 '24 01:03 cragwolfe

@cragwolfe

No, it should always be possible to explicitly specify the file(s) content_type.

In this case I think the best idea would be to create a separate issue for adding content_type handling inside partition_via_api function. DOD would also require checking if bug described here in this issue exists.

mpolomdeepsense avatar Mar 04 '24 15:03 mpolomdeepsense