llama_index
llama_index copied to clipboard
Update csv parser
This PR updates the CSVParser
with a concatenate
key parameter, allowing the creation of a separate Document
for each row. The concatenate
option is set to True
by default to preserve current behavior. This required updating the return signature in the BaseParser.parse_file
method to return a union of either str
or List[str]
. It also required me to update some of the logic in the SimpleDirectoryReader.load_data
method.
IMO I think parse_file
should always return a List[str]
for any reader. This is because we can always call extend()
and pass a list of a single string rather than having to type check data
to know to call either extend()
or append()
Example use:
customExtractor = { '.csv': CSVParser(concatenate=False) }
data = SimpleDirectoryReader("data", file_extractor=customExtractor).load_data()
Now that I think about it, it might also be worth renaming concatenate
to something else to accommodate returning documents of columns in the future.