langchainjs
langchainjs copied to clipboard
feat: add ability to select columns from csv to use as metadata
Feat: add functionality to the CSVLoader to be able to set columns as metadata.
I added the docs and implement the function in the CSVLoader class. The usage below would explain how it works.
Usage, extracting a single column with metadata
Example CSV file:
hadith_id,chapter_no,hadith_no,chapter,text_ar,text_en,source
91,3,91,Knowledge - كتاب العلم,"حدثنا عبد الله بن محمد... ثم أدها إليه ".","Narrated Zaid bin Khalid Al-Juhani:... for the wolf.",Sahih Bukhari
92,3,92,Knowledge - كتاب العلم,"حدثنا محمد بن العلاء... إلى الله عز وجل.","Narrated Abu Musa:... (Our offending you).",Sahih Bukhari
93,3,93,Knowledge - كتاب العلم,"حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.","Narrated Anas bin Malik:... the Prophet became silent.",Sahih Bukhari
Example code:
import { CSVLoader } from "langchain/document_loaders";
const loader = new CSVLoader(
"all_hadiths_clean.csv",
"text_ar",
["text_en", "source", "hadith_id", "chapter_no", "hadith_no", "chapter"]
);
const docs = await loader.load();
/*
[
Document {
pageContent: 'حدثنا عبد الله بن محمد... ثم أدها إليه ".',
metadata: {
text_en: ' Narrated Zaid bin Khalid Al-Juhani:... for the wolf."',
source: 'Sahih Bukhari',
hadith_id: '91',
chapter_no: '3',
hadith_no: ' 91 ',
chapter: 'Knowledge - كتاب العلم',
line: 91
}
},
Document {
pageContent: 'حدثنا محمد بن العلاء... إلى الله عز وجل.',
metadata: {
text_en: ' Narrated Abu Musa:... (Our offending you).',
source: 'Sahih Bukhari',
hadith_id: '92',
chapter_no: '3',
hadith_no: ' 92 ',
chapter: 'Knowledge - كتاب العلم',
line: 92
}
},
Document {
pageContent: 'حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.',
metadata: {
text_en: ' Narrated Anas bin Malik:... the Prophet became silent.',
source: 'Sahih Bukhari',
hadith_id: '93',
chapter_no: '3',
hadith_no: ' 93 ',
chapter: 'Knowledge - كتاب العلم',
line: 93
}
}
]
*/
Tested on my local machine:
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Updated (UTC) |
---|---|---|---|
langchainjs-docs | ✅ Ready (Inspect) | Visit Preview | Apr 10, 2023 9:52am |
this functionality seems great to me
thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this
this functionality seems great to me
thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this
so the pipeline is generally:
- load documents
- split documents (with the text splitters)
- embed text
so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?
this functionality seems great to me
thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this
so the pipeline is generally:
- load documents
- split documents (with the text splitters)
- embed text
so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?
sorry I am still going through the codebase and dont fully grasp it. do you mean the current implementation should split the rows into multiple documents or that is the goal? by documents I mean the Document class, because i think you meant document in general sense. Also should I fix anything with my current code?
this functionality seems great to me
thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this
so the pipeline is generally:
- load documents
- split documents (with the text splitters)
- embed text
so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?
I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.
I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.
for csv in particular. should we add a metadata signifying a row has been splitted, putting chunk number, e.g. if a row splitted into 3, theres a metadata of chunk: 1/3 , chunk: 2/3 and chunk 3/3 ?
also currently theres no option to modify the loadAndSplit function to customize chunkSize and chunkOverlap. should I override the loadAndSplit function in the CSVLoader class?
@nfcampos I would like a review, used for my use cases and worked for me! If you approve I will try to find a way to add the same feature for python and other document loaders!