private-gpt
private-gpt copied to clipboard
DJVU support?
Is your feature request related to a problem? Please describe. Are .djvu files a thing? I assume pdf are indexed by dumping text from them. For djvu there's djvutext. Could djvu be included in supported formats, and is it hard to implement?
Describe the solution you'd like DJVU files supported along other documents.
Describe alternatives you've considered Mass converting, but there are no good tools for that, and .djvu has better filesizes in any case.
It is not hard to add text based formats. However, it appears that DJVU specializes in scanned documents that might have only images without the text OCR'ed. Adding auto OCR for those cases will be much more involved unless the djvutext tool does that part. When it is easy, you import the to-text converter for the document type (perhaps djvutext), add the name of the converter and the file suffix to the table in ingest.py and if you are fortunate it might just work. I had not heard of this format until today...
Well, i've also written a script to batch convert djvu to pdf, if anyone is even using djvu anymore, i think this can be closed.
The script - https://github.com/installgentoo/djvu2pdf