private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

DJVU support?

Open installgentoo opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. Are .djvu files a thing? I assume pdf are indexed by dumping text from them. For djvu there's djvutext. Could djvu be included in supported formats, and is it hard to implement?

Describe the solution you'd like DJVU files supported along other documents.

Describe alternatives you've considered Mass converting, but there are no good tools for that, and .djvu has better filesizes in any case.

installgentoo avatar May 30 '23 15:05 installgentoo

It is not hard to add text based formats. However, it appears that DJVU specializes in scanned documents that might have only images without the text OCR'ed. Adding auto OCR for those cases will be much more involved unless the djvutext tool does that part. When it is easy, you import the to-text converter for the document type (perhaps djvutext), add the name of the converter and the file suffix to the table in ingest.py and if you are fortunate it might just work. I had not heard of this format until today...

johnbrisbin avatar May 30 '23 22:05 johnbrisbin

Well, i've also written a script to batch convert djvu to pdf, if anyone is even using djvu anymore, i think this can be closed.

The script - https://github.com/installgentoo/djvu2pdf

installgentoo avatar May 31 '23 16:05 installgentoo