langchain icon indicating copy to clipboard operation
langchain copied to clipboard

HuggingfaceDatasetLoader escapes strings instead of returning them raw

Open skaltenp opened this issue 3 months ago • 1 comments

Checked other resources

  • [X] I added a very descriptive title to this issue.
  • [X] I searched the LangChain documentation with the integrated search.
  • [X] I used the GitHub search to find a similar question and didn't find it.
  • [X] I am sure that this is a bug in LangChain rather than my code.
  • [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following code converts every instance with json.dumps:

class HuggingFaceDatasetLoader(BaseLoader):
...
    def parse_obj(self, page_content: Union[str, object]) -> str:
            if isinstance(page_content, object):
                return json.dumps(page_content)
            return page_content

This leads to double escape characters in the strings such as: "\n" converted to "\\n". A short fix (not the best one, but working) is to implement a check for strings first:

class HuggingFaceDatasetLoader(BaseLoader):
...
    def parse_obj(self, page_content: Union[str, object]) -> str:
            if isinstance(page_content, str):
                return page_content
            return json.dumps(page_content)

Error Message and Stack Trace (if applicable)

Noe error message

Description

I am trying to load HuggingFace datasets with markdown formatted strings. This leads to double escaped characters such as "\n" to "\\n" due to json.dumps() in the original code. This is may caused by every object of type "str" being also an object.

System Info

System Information

OS: Linux OS Version: #115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 Python Version: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0]

Package Information

langchain_core: 0.3.17 langchain: 0.3.7 langchain_community: 0.3.7 langsmith: 0.1.142 langchain_huggingface: 0.1.2 langchain_ollama: 0.2.0 langchain_text_splitters: 0.3.2

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.10 async-timeout: Installed. No version info available. dataclasses-json: 0.6.7 httpx: 0.27.0 httpx-sse: 0.4.0 huggingface-hub: 0.26.2 jsonpatch: 1.33 numpy: 1.26.4 ollama: 0.3.3 orjson: 3.10.11 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.6.1 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 sentence-transformers: 3.3.0 SQLAlchemy: 2.0.35 tenacity: 9.0.0 tokenizers: 0.20.3 transformers: 4.46.2 typing-extensions: 4.11.0

skaltenp avatar Nov 12 '24 21:11 skaltenp