llama2_aided_tesseract
llama2_aided_tesseract copied to clipboard
Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections, complete with options for text validation and hallucination filtering.
Hi, Your code used llma2 chat offline LLM. But, I wanted to use alternative offline LLMs such as huggingface's distilbert or roberta or albert. Do you have any suggestion for...
Is there any plan to restructure the code to be uniform to use it with Llama2/API like (gpt-3.5-turbo, gpt-4) to use this PDF-to-text in any hardware. https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L180 https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L122 https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L173
for these `doc format convertion`, `text summarization` tasks, I think one of key feature is to include all or some of the images/charts/tables from original doc, as those elements often...
Your provided [tesseract_with_llama2_corrections.py] code snippet is equipped with the llma2 chat ggml q3 k_s.bin LLM model but the huggingface.co is referring to use GGUF instead saying the GGML is deprecated....