llama_index
llama_index copied to clipboard
Update csv parser
This PR updates the CSVParser with a concatenate key parameter, allowing the creation of a separate Document for each row. The concatenate option is set to True by default to preserve current behavior. This required updating the return signature in the BaseParser.parse_file method to return a union of either str or List[str]. It also required me to update some of the logic in the SimpleDirectoryReader.load_data method.
IMO I think parse_file should always return a List[str] for any reader. This is because we can always call extend() and pass a list of a single string rather than having to type check data to know to call either extend() or append()
Example use:
customExtractor = { '.csv': CSVParser(concatenate=False) }
data = SimpleDirectoryReader("data", file_extractor=customExtractor).load_data()
Now that I think about it, it might also be worth renaming concatenate to something else to accommodate returning documents of columns in the future.