langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Can only load text as a text file, not as string input

Open FayZ676 opened this issue 1 year ago • 3 comments

I want to be able to pass pure string text, not as a text file. When I attempt to do so with long documents I get the error about the file name being too long:

Traceback (most recent call last):
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 436, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/faizi/Projects/docu-query/langchain/main.py", line 50, in query
    response = query_document(query, text)
  File "/home/faizi/Projects/docu-query/langchain/__langchain__.py", line 13, in query_document
    index = VectorstoreIndexCreator().from_loaders([loader])
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/langchain/indexes/vectorstore.py", line 69, in from_loaders
    docs.extend(loader.load())
  File "/home/faizi/miniconda3/envs/langchain/lib/python3.10/site-packages/langchain/document_loaders/text.py", line 17, in load
    with open(self.file_path, encoding=self.encoding) as f:
OSError: [Errno 36] File name too long:

The way I've been able to get it to work has been like so:

    # get document from supabase where userName = userName
    document = supabase \
        .table('Documents') \
        .select('document') \
        .eq('userName', userName) \
        .execute()
    text = document.data[0]['document']

    # write text to a temporary file\
    temp = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8')
    temp.write(text)
    temp.seek(0)

    # query the document
    loader = TextLoader(temp.name)
    index = VectorstoreIndexCreator().from_loaders([loader])
    response = index.query(query)

    # delete the temporary file
    temp.close()

There must be a more straight forward way. Am I missing something here?

FayZ676 avatar Apr 25 '23 22:04 FayZ676

I think this is all a bit of a mess. First of all, I don't think the carrier of the document should be conflated with the content. So, for example, UnstructuredHTMLLoader derives from UnstructuredFileLoader. This doesn't make make sense because a file is not the only source of HTML. It can be a file, a socket (e.g. via HTTP), a fileIO object, a builder object, whatever. From examining the code, it looks as if the reason (at least for the HTML case) is to use the partitioning facilities in the unstructured library. The basic TextLoader case looks to have similar issues.

You can probably solve your problem by using UnstructuredFileIOLoader instead, and wrapping with an io object. That way you at least don't need to write a temp file.

It does look to me as if the DocumentLoader bits need some refactoring and cleaning up. I'm just really getting familiar with the codebase, and maybe a bit later, I'll use it as a deep-dive tools and prepare a PR. No promises, though 😝

uogbuji avatar May 22 '23 03:05 uogbuji

I was able to just load HTML strings by writing my own loader class, as follows. You should be able to do similar for your case. I still think DocumentLoader needs some TLC.

from langchain.document_loaders.unstructured import UnstructuredBaseLoader

class UnstructuredHtmlStringLoader(UnstructuredBaseLoader):
    '''
    Uses unstructured to load a string
    Source of the string, for metadata purposes, can be passed in by the caller
    '''

    def __init__(
        self, content: str, source: str = None, mode: str = "single",
        **unstructured_kwargs: Any
    ):
        self.content = content
        self.source = source
        super().__init__(mode=mode, **unstructured_kwargs)

    def _get_elements(self) -> List:
        from unstructured.partition.html import partition_html

        return partition_html(text=self.content, **self.unstructured_kwargs)

    def _get_metadata(self) -> dict:
        return {"source": self.source} if self.source else {}

uogbuji avatar May 22 '23 03:05 uogbuji

You can use this:

from langchain.text_splitter import CharacterTextSplitter
from langchain.schema.document import Document

def get_text_chunks_langchain(text):
   text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
   docs = [Document(page_content=x) for x in text_splitter.split_text(text)]
   return docs

Edit: I edited the code as @justege suggested

dhpour avatar May 27 '23 11:05 dhpour

You can use this:

from langchain.text_splitter import CharacterTextSplitter
def get_text_chunks_langchain(text):
   text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
   docs = text_splitter.split_text(text)
   return docs
    summary_chain = load_summarize_chain(llm, chain_type='refine')
    res = summary_chain.run(docs)

Prompt error AttributeError: 'str' object has no attribute 'page_content'.

Finally I solved this problem by UnstructuredFileIOLoader.but I still think it's stupid. The content is not only obtained through files, the acquisition of file content should be solved by the user's own coding, and should not be coupled together.

MaxwellEdisons avatar Jul 21 '23 08:07 MaxwellEdisons

I have found a solution for this in the typescript library, so you can probably do something similar in Python, too.

I used the TextLoader class and converted my text string into a blob, and the TextLoader accepts the blob type as an input argument.

This lets me parse the raw text without having to create a temporary file and loading it.

YourAverageTechBro avatar Aug 06 '23 18:08 YourAverageTechBro

Check this out guys: https://blog.streamlit.io/langchain-tutorial-3-build-a-text-summarization-app/

i solve it with this:

def get_text_chunks_langchain(text): text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100) texts = text_splitter.split_text(text) docs = [Document(page_content=t) for t in texts] return docs

justege avatar Aug 13 '23 09:08 justege

Just going to add, since I ran into this, that you can potentially do:

from langchain.document_loaders.unstructured import UnstructuredFileIOLoader
from io import StringIO

loader = UnstructuredFileIOLoader(StringIO(text), mode="elements")

That should work...

BUT... it adds unstructured as a dependency, which then has sentence_transformers as a dependency. That has pytorch and a bunch of nvidia libraries as a dependency. If you aren't already using those, it adds nearly 2GB of dependencies to your project, which seems a bit ridiculous if all you want is to load in strings without having to go through the filesystem.

thraxil avatar Aug 24 '23 12:08 thraxil

Hi, @FayZ676! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were experiencing an error when trying to pass pure string text instead of a text file. You mentioned that you found a workaround by writing the text to a temporary file, but you were looking for a more straightforward solution.

In the comments, uogbuji suggested using UnstructuredFileIOLoader and wrapping it with an io object to avoid writing a temporary file. lingwndr and YourAverageTechBro also provided alternative code snippets to handle the issue.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Nov 24 '23 16:11 dosubot[bot]