langchain
langchain copied to clipboard
HuggingfaceDatasetLoader escapes strings instead of returning them raw
Checked other resources
- [X] I added a very descriptive title to this issue.
- [X] I searched the LangChain documentation with the integrated search.
- [X] I used the GitHub search to find a similar question and didn't find it.
- [X] I am sure that this is a bug in LangChain rather than my code.
- [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
The following code converts every instance with json.dumps:
class HuggingFaceDatasetLoader(BaseLoader):
...
def parse_obj(self, page_content: Union[str, object]) -> str:
if isinstance(page_content, object):
return json.dumps(page_content)
return page_content
This leads to double escape characters in the strings such as: "\n" converted to "\\n". A short fix (not the best one, but working) is to implement a check for strings first:
class HuggingFaceDatasetLoader(BaseLoader):
...
def parse_obj(self, page_content: Union[str, object]) -> str:
if isinstance(page_content, str):
return page_content
return json.dumps(page_content)
Error Message and Stack Trace (if applicable)
Noe error message
Description
I am trying to load HuggingFace datasets with markdown formatted strings. This leads to double escaped characters such as "\n" to "\\n" due to json.dumps() in the original code. This is may caused by every object of type "str" being also an object.
System Info
System Information
OS: Linux OS Version: #115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 Python Version: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0]
Package Information
langchain_core: 0.3.17 langchain: 0.3.7 langchain_community: 0.3.7 langsmith: 0.1.142 langchain_huggingface: 0.1.2 langchain_ollama: 0.2.0 langchain_text_splitters: 0.3.2
Optional packages not installed
langgraph langserve
Other Dependencies
aiohttp: 3.10.10 async-timeout: Installed. No version info available. dataclasses-json: 0.6.7 httpx: 0.27.0 httpx-sse: 0.4.0 huggingface-hub: 0.26.2 jsonpatch: 1.33 numpy: 1.26.4 ollama: 0.3.3 orjson: 3.10.11 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.6.1 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 sentence-transformers: 3.3.0 SQLAlchemy: 2.0.35 tenacity: 9.0.0 tokenizers: 0.20.3 transformers: 4.46.2 typing-extensions: 4.11.0