semantra icon indicating copy to clipboard operation
semantra copied to clipboard

Import PDF files from a dir

Open spectramaster opened this issue 1 year ago • 2 comments

@freedmand

Good job! Semantra runs smoothly on my linux PC!

I think the command options :

semantra [dir]
semantra [dir1] [dir2] [....]

which can import one or more dirs contain many PDF files are useful and helpful.

spectramaster avatar Apr 26 '23 01:04 spectramaster

Agreed! This seems useful. I'm thinking the behavior that makes sense would be to recursively include .txt and .pdf files when you specify a directory. Do you also think that makes sense?

freedmand avatar Apr 26 '23 04:04 freedmand

Of course! Import many files with various types including .txt .pdf in a dir is essentially beneficial for the experience of using semantra.

I think the ''Unstructured'' package in Langchain which can parse different types of file including .txt .pdf may be a good technical solution.

https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/unstructured_file.html

https://github.com/Unstructured-IO/unstructured

spectramaster avatar Apr 27 '23 04:04 spectramaster