llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Update csv parser

Open aliyeysides opened this issue 2 years ago • 1 comments

This PR updates the CSVParser with a concatenate key parameter, allowing the creation of a separate Document for each row. The concatenate option is set to True by default to preserve current behavior. This required updating the return signature in the BaseParser.parse_file method to return a union of either str or List[str]. It also required me to update some of the logic in the SimpleDirectoryReader.load_data method.

IMO I think parse_file should always return a List[str] for any reader. This is because we can always call extend() and pass a list of a single string rather than having to type check data to know to call either extend() or append()

Example use:

customExtractor = { '.csv': CSVParser(concatenate=False) }
data = SimpleDirectoryReader("data", file_extractor=customExtractor).load_data()

aliyeysides avatar Jan 25 '23 20:01 aliyeysides

Now that I think about it, it might also be worth renaming concatenate to something else to accommodate returning documents of columns in the future.

aliyeysides avatar Jan 25 '23 21:01 aliyeysides