unstructured
unstructured copied to clipboard
bug/file extensions supersede `content_type` when partitioning via API
Describe the bug
When using partition_via_api
, the file extension for file_filename
supersedes the content_type
that the user passes in.
To Reproduce
The following results in a 400
from the API
file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"
file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()
But the following does not:
The following produces an error:
file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"
file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()
Expected behavior Both should successfully process the file as plain text.
Desktop (please complete the following information):
-
unstructured==0.6.3
- Python version 3.8
Additional context
- Initially reported by a
langchain
user here.
close this?
Closing because over 180 days old. Please reopen with a comment if the issue is still relevant.
Reopened by eng request; on current sprint
@MthwRobinson
Ok, so I started looking into the issue and I'm not sure if it's still relevant. The version I'm testing it on is 0.12.5
, and although partition_via_api
accepts content_type
argument it is not used anywhere inside the function.
Using given example I wasn't able to reproduce the issue. Also, both of the given examples are the same and one is supposed to throw 400 the other one an error (???). I suppose maybe one of them shouldn't use content_type
argument. I checked that and it was also working.
Then I checked by executing it directly using partition_via_api
and It gives the same results as previous tests.
As for the unused content_type
is this expected behavior? I noticed in similar function partition_multiple_via_api
it takes content_types
argument and in there this argument is used.
As for the unused content_type is this expected behavior? I noticed in similar function partition_multiple_via_api it takes content_types argument and in there this argument is used.
No, it should always be possible to explicitly specify the file(s) content_type.
@cragwolfe
No, it should always be possible to explicitly specify the file(s) content_type.
In this case I think the best idea would be to create a separate issue for adding content_type
handling inside partition_via_api
function. DOD would also require checking if bug described here in this issue exists.