langchainjs icon indicating copy to clipboard operation
langchainjs copied to clipboard

feat: add ability to select columns from csv to use as metadata

Open k11q opened this issue 1 year ago • 7 comments

Feat: add functionality to the CSVLoader to be able to set columns as metadata.

I added the docs and implement the function in the CSVLoader class. The usage below would explain how it works.

Usage, extracting a single column with metadata

Example CSV file:

hadith_id,chapter_no,hadith_no,chapter,text_ar,text_en,source
91,3,91,Knowledge - كتاب العلم,"حدثنا عبد الله بن محمد... ثم أدها إليه ".","Narrated Zaid bin Khalid Al-Juhani:... for the wolf.",Sahih Bukhari
92,3,92,Knowledge - كتاب العلم,"حدثنا محمد بن العلاء... إلى الله عز وجل.","Narrated Abu Musa:... (Our offending you).",Sahih Bukhari
93,3,93,Knowledge - كتاب العلم,"حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.","Narrated Anas bin Malik:... the Prophet became silent.",Sahih Bukhari

Example code:

import { CSVLoader } from "langchain/document_loaders";
const loader = new CSVLoader(
  "all_hadiths_clean.csv",
  "text_ar",
  ["text_en", "source", "hadith_id", "chapter_no", "hadith_no", "chapter"]
);
const docs = await loader.load();
/*
[
  Document {
    pageContent: 'حدثنا عبد الله بن محمد... ثم أدها إليه ".',
    metadata: {
      text_en: ' Narrated Zaid bin Khalid Al-Juhani:... for the wolf."',
      source: 'Sahih Bukhari',
      hadith_id: '91',
      chapter_no: '3',
      hadith_no: ' 91 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 91
    }
  },
  Document {
    pageContent: 'حدثنا محمد بن العلاء... إلى الله عز وجل.',
    metadata: {
      text_en: ' Narrated Abu Musa:... (Our offending you).',
      source: 'Sahih Bukhari',
      hadith_id: '92',
      chapter_no: '3',
      hadith_no: ' 92 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 92
    }
  },
  Document {
    pageContent: 'حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.',
    metadata: {
      text_en: ' Narrated Anas bin Malik:... the Prophet became silent.',
      source: 'Sahih Bukhari',
      hadith_id: '93',
      chapter_no: '3',
      hadith_no: ' 93 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 93
    }
  }
]
*/

Tested on my local machine: Screenshot 2023-04-07 at 7 05 46 PM

k11q avatar Apr 07 '23 11:04 k11q

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 10, 2023 9:52am

vercel[bot] avatar Apr 07 '23 11:04 vercel[bot]

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

k11q avatar Apr 07 '23 14:04 k11q

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

hwchase17 avatar Apr 07 '23 14:04 hwchase17

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

sorry I am still going through the codebase and dont fully grasp it. do you mean the current implementation should split the rows into multiple documents or that is the goal? by documents I mean the Document class, because i think you meant document in general sense. Also should I fix anything with my current code?

k11q avatar Apr 07 '23 14:04 k11q

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

k11q avatar Apr 08 '23 11:04 k11q

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

for csv in particular. should we add a metadata signifying a row has been splitted, putting chunk number, e.g. if a row splitted into 3, theres a metadata of chunk: 1/3 , chunk: 2/3 and chunk 3/3 ?

also currently theres no option to modify the loadAndSplit function to customize chunkSize and chunkOverlap. should I override the loadAndSplit function in the CSVLoader class?

k11q avatar Apr 08 '23 11:04 k11q

@nfcampos I would like a review, used for my use cases and worked for me! If you approve I will try to find a way to add the same feature for python and other document loaders!

k11q avatar Apr 11 '23 11:04 k11q