unstructured
unstructured copied to clipboard
bug/file extensions supersede `content_type` when partitioning via API
Describe the bug
When using partition_via_api
, the file extension for file_filename
supersedes the content_type
that the user passes in.
To Reproduce
The following results in a 400
from the API
file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"
file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()
But the following does not:
The following produces an error:
file_url = "https://s3.eu-north-1.amazonaws.com/clickable.so/f7edd61e-cef6-42cb-9608-5cd2e56479b8"
file_name = "test-document.csv"
file_response = requests.get(file_url)
loader = UnstructuredAPIFileIOLoader(
file=file_response.content, file_filename=file_name, content_type="text/plain"
)
docs = loader.load()
Expected behavior Both should successfully process the file as plain text.
Desktop (please complete the following information):
-
unstructured==0.6.3
- Python version 3.8
Additional context
- Initially reported by a
langchain
user here.
close this?
Closing because over 180 days old. Please reopen with a comment if the issue is still relevant.
Reopened by eng request; on current sprint
@MthwRobinson
Ok, so I started looking into the issue and I'm not sure if it's still relevant. The version I'm testing it on is 0.12.5
, and although partition_via_api
accepts content_type
argument it is not used anywhere inside the function.
Using given example I wasn't able to reproduce the issue. Also, both of the given examples are the same and one is supposed to throw 400 the other one an error (???). I suppose maybe one of them shouldn't use content_type
argument. I checked that and it was also working.
Then I checked by executing it directly using partition_via_api
and It gives the same results as previous tests.
As for the unused content_type
is this expected behavior? I noticed in similar function partition_multiple_via_api
it takes content_types
argument and in there this argument is used.
As for the unused content_type is this expected behavior? I noticed in similar function partition_multiple_via_api it takes content_types argument and in there this argument is used.
No, it should always be possible to explicitly specify the file(s) content_type.
@cragwolfe
No, it should always be possible to explicitly specify the file(s) content_type.
In this case I think the best idea would be to create a separate issue for adding content_type
handling inside partition_via_api
function. DOD would also require checking if bug described here in this issue exists.
Closing this since the unstructured python client is the now the best supported way for partitioning through the API.