private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

How to read csv with text data in multiple cells

Open haris525 opened this issue 2 years ago • 5 comments

Hi guys good morning, How would I go about reading text data that is contained in multiple cells of a csv? I updated my ingest.py file to the code below, and it has been running for 10+ hours straight.

Here is my updated code

def load_single_document(file_path: str) -> List[Document]:
    # Loads a single document from a file path
    if file_path.endswith(".txt"):
        loader = TextLoader(file_path, encoding="utf8")
    elif file_path.endswith(".pdf"):
        loader = PDFMinerLoader(file_path)
    elif file_path.endswith(".csv"):
        loader = CSVLoader(file_path, encoding="utf8")
    return loader.load()

def load_documents(source_dir: str) -> List[Document]:
    # Loads all documents from source documents directory
    txt_files = glob.glob(os.path.join(source_dir, "**/*.txt"), recursive=True)
    pdf_files = glob.glob(os.path.join(source_dir, "**/*.pdf"), recursive=True)
    csv_files = glob.glob(os.path.join(source_dir, "**/*.csv"), recursive=True)
    all_files = txt_files + pdf_files + csv_files
    documents = [load_single_document(file_path) for file_path in all_files]
    return [doc for sublist in documents for doc in sublist]

any feedback to make this faster is appreciated! thank you

haris525 avatar May 16 '23 13:05 haris525

What do you mean by multi-celled csv?

watrgoat avatar May 16 '23 16:05 watrgoat

......Python311\Lib\site-packages\chromadb\api\types.py", line 75, in maybe_cast_one_to_many if isinstance(target[0], (int, float)): ~~~~~~^^^ IndexError: list index out of range

I got this error but why ??? I do have 500++ cell and 8 - 10 column data but it blows I think

Pylyric61 avatar May 16 '23 20:05 Pylyric61

What do you mean by multi-celled csv?

hello, as text contained in multiple cells of a csv file. Like this

index - comments 1 - this is a comment 2 - this is another comment 3 - this is a comment separated by a comma, here is another one

haris525 avatar May 16 '23 20:05 haris525

Have you tried threading?

Thrilok28021996 avatar May 17 '23 15:05 Thrilok28021996

I think there is a problem with their csv file, try the most recent version, because the ingest file has been updated. I was able to read in a csv file with my own testing with multiple cells. it just outputs a list of dictionaries. it looks like this:

[page_content='fName: John\nlName: Doe\nAddr: 120 jefferson st.\nCity: Riverside\nState: NJ\nZip: 8075' metadata={'source': 'addresses.csv', 'row': 0},
page_content='fName: Jack\nlName: McGinnis\nAddr: 220 hobo Av.\nCity: Phila\nState: PA\nZip: 9119' metadata={'source': 'addresses.csv', 'row': 1},
page_content='fName: John "Da Man"\nlName: Repici\nAddr: 120 Jefferson St.\nCity: Riverside\nState: NJ\nZip: 8075' metadata={'source': 'addresses.csv', 'row': 2},
page_content='fName: Stephen\nlName: Tyler\nAddr: 7452 Terrace "At the Plaza" road\nCity: SomeTown\nState: SD\nZip: 91234' metadata={'source': 'addresses.csv', 'row': 3},
page_content='fName: \nlName: Blankman\nAddr: \nCity: SomeTown\nState: SD\nZip: 298' metadata={'source': 'addresses.csv', 'row': 4},
page_content='fName: Joan "the bone", Anne\nlName: Jet\nAddr: 9th, at Terrace plc\nCity: Desert City\nState: CO\nZip: 123' metadata={'source': 'addresses.csv', 'row': 5}]

this is the type of each of the items in the list <class 'langchain.schema.Document'>

watrgoat avatar May 17 '23 16:05 watrgoat