llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

SimpleDirectoryReader taking too much time to load the file.

Open tv-ankur opened this issue 1 year ago • 3 comments

I trying to load 78 pdf which is ~1 MB size of each file. Its been 24 hrs on AWS but still its processing the file-

Screenshot 2023-03-12 at 6 11 49 PM

Here is the code-

# read the pdf from the folder 
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

loader = SimpleDirectoryReader('./data', file_extractor={
  ".pdf": "UnstructuredReader",
  ".html": "UnstructuredReader",
  ".eml": "UnstructuredReader",
  ".pptx": "PptxReader"
})
documents = loader.load_data()

Why it is taking too much time? can anyone suggest any improvement?

tv-ankur avatar Mar 12 '23 12:03 tv-ankur

Why are you using unstructuredReader for e.g. PDF? Better use PDFReader = download_loader("PDFReader")

Terranic avatar Mar 13 '23 13:03 Terranic

Is it support folder? (or I need to process one by one)

and what is the difference between-

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

documents = SimpleDirectoryReader('content').load_data()

tv-ankur avatar Mar 13 '23 16:03 tv-ankur

@tv-ankur if you don't specify a file_extractor, we also have a PDF loader out of the box with SimpleDirectoryReader (it will download PDFReader under the hood)

closing this issue since it's more question/discussion-based. for more q's please join discord! https://discord.gg/dGcwcsnxhU

jerryjliu avatar Mar 16 '23 18:03 jerryjliu