llama_index
llama_index copied to clipboard
SimpleDirectoryReader taking too much time to load the file.
I trying to load 78 pdf which is ~1 MB size of each file. Its been 24 hrs on AWS but still its processing the file-
data:image/s3,"s3://crabby-images/89a9e/89a9ef429e5e310058a50ccb0fcededcce21b56d" alt="Screenshot 2023-03-12 at 6 11 49 PM"
Here is the code-
# read the pdf from the folder
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")
loader = SimpleDirectoryReader('./data', file_extractor={
".pdf": "UnstructuredReader",
".html": "UnstructuredReader",
".eml": "UnstructuredReader",
".pptx": "PptxReader"
})
documents = loader.load_data()
Why it is taking too much time? can anyone suggest any improvement?
Why are you using unstructuredReader for e.g. PDF?
Better use
PDFReader = download_loader("PDFReader")
Is it support folder? (or I need to process one by one)
and what is the difference between-
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")
documents = SimpleDirectoryReader('content').load_data()
@tv-ankur if you don't specify a file_extractor, we also have a PDF loader out of the box with SimpleDirectoryReader
(it will download PDFReader
under the hood)
closing this issue since it's more question/discussion-based. for more q's please join discord! https://discord.gg/dGcwcsnxhU