langchain
langchain copied to clipboard
Can only load text as a text file, not as string input
I want to be able to pass pure string text, not as a text file. When I attempt to do so with long documents I get the error about the file name being too long:
Traceback (most recent call last):
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 436, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
return await dependant.call(**values)
File "/home/faizi/Projects/docu-query/langchain/main.py", line 50, in query
response = query_document(query, text)
File "/home/faizi/Projects/docu-query/langchain/__langchain__.py", line 13, in query_document
index = VectorstoreIndexCreator().from_loaders([loader])
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/langchain/indexes/vectorstore.py", line 69, in from_loaders
docs.extend(loader.load())
File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/langchain/document_loaders/text.py", line 17, in load
with open(self.file_path, encoding=self.encoding) as f:
OSError: [Errno 36] File name too long:
The way I've been able to get it to work has been like so:
# get document from supabase where userName = userName
document = supabase \
.table('Documents') \
.select('document') \
.eq('userName', userName) \
.execute()
text = document.data[0]['document']
# write text to a temporary file\
temp = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8')
temp.write(text)
temp.seek(0)
# query the document
loader = TextLoader(temp.name)
index = VectorstoreIndexCreator().from_loaders([loader])
response = index.query(query)
# delete the temporary file
temp.close()
There must be a more straight forward way. Am I missing something here?
I think this is all a bit of a mess. First of all, I don't think the carrier of the document should be conflated with the content. So, for example, UnstructuredHTMLLoader
derives from UnstructuredFileLoader
. This doesn't make make sense because a file is not the only source of HTML. It can be a file, a socket (e.g. via HTTP), a fileIO object, a builder object, whatever. From examining the code, it looks as if the reason (at least for the HTML case) is to use the partitioning facilities in the unstructured library. The basic TextLoader
case looks to have similar issues.
You can probably solve your problem by using UnstructuredFileIOLoader instead, and wrapping with an io object. That way you at least don't need to write a temp file.
It does look to me as if the DocumentLoader
bits need some refactoring and cleaning up. I'm just really getting familiar with the codebase, and maybe a bit later, I'll use it as a deep-dive tools and prepare a PR. No promises, though 😝
I was able to just load HTML strings by writing my own loader class, as follows. You should be able to do similar for your case. I still think DocumentLoader
needs some TLC.
from langchain.document_loaders.unstructured import UnstructuredBaseLoader
class UnstructuredHtmlStringLoader(UnstructuredBaseLoader):
'''
Uses unstructured to load a string
Source of the string, for metadata purposes, can be passed in by the caller
'''
def __init__(
self, content: str, source: str = None, mode: str = "single",
**unstructured_kwargs: Any
):
self.content = content
self.source = source
super().__init__(mode=mode, **unstructured_kwargs)
def _get_elements(self) -> List:
from unstructured.partition.html import partition_html
return partition_html(text=self.content, **self.unstructured_kwargs)
def _get_metadata(self) -> dict:
return {"source": self.source} if self.source else {}
You can use this:
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema.document import Document
def get_text_chunks_langchain(text):
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = [Document(page_content=x) for x in text_splitter.split_text(text)]
return docs
Edit: I edited the code as @justege suggested
You can use this:
from langchain.text_splitter import CharacterTextSplitter def get_text_chunks_langchain(text): text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100) docs = text_splitter.split_text(text) return docs
summary_chain = load_summarize_chain(llm, chain_type='refine')
res = summary_chain.run(docs)
Prompt error AttributeError: 'str' object has no attribute 'page_content'.
Finally I solved this problem by UnstructuredFileIOLoader.but I still think it's stupid. The content is not only obtained through files, the acquisition of file content should be solved by the user's own coding, and should not be coupled together.
I have found a solution for this in the typescript library, so you can probably do something similar in Python, too.
I used the TextLoader
class and converted my text string
into a blob
, and the TextLoader
accepts the blob
type as an input argument.
This lets me parse the raw text without having to create a temporary file and loading it.
Check this out guys: https://blog.streamlit.io/langchain-tutorial-3-build-a-text-summarization-app/
i solve it with this:
def get_text_chunks_langchain(text): text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100) texts = text_splitter.split_text(text) docs = [Document(page_content=t) for t in texts] return docs
Just going to add, since I ran into this, that you can potentially do:
from langchain.document_loaders.unstructured import UnstructuredFileIOLoader
from io import StringIO
loader = UnstructuredFileIOLoader(StringIO(text), mode="elements")
That should work...
BUT... it adds unstructured
as a dependency, which then has sentence_transformers
as a dependency. That has pytorch
and a bunch of nvidia libraries as a dependency. If you aren't already using those, it adds nearly 2GB of dependencies to your project, which seems a bit ridiculous if all you want is to load in strings without having to go through the filesystem.
Hi, @FayZ676! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you were experiencing an error when trying to pass pure string text instead of a text file. You mentioned that you found a workaround by writing the text to a temporary file, but you were looking for a more straightforward solution.
In the comments, uogbuji suggested using UnstructuredFileIOLoader and wrapping it with an io object to avoid writing a temporary file. lingwndr and YourAverageTechBro also provided alternative code snippets to handle the issue.
Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!