llama2_aided_tesseract icon indicating copy to clipboard operation
llama2_aided_tesseract copied to clipboard

Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections, complete with options for text validation and hallucination filtering.

Results 4 llama2_aided_tesseract issues
Sort by recently updated
recently updated
newest added

Hi, Your code used llma2 chat offline LLM. But, I wanted to use alternative offline LLMs such as huggingface's distilbert or roberta or albert. Do you have any suggestion for...

Is there any plan to restructure the code to be uniform to use it with Llama2/API like (gpt-3.5-turbo, gpt-4) to use this PDF-to-text in any hardware. https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L180 https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L122 https://github.com/Dicklesworthstone/llama2_aided_tesseract/blob/5719a9aede6b0666f6f08d239cac7b1550298b79/tesseract_with_llama2_corrections.py#L173

for these `doc format convertion`, `text summarization` tasks, I think one of key feature is to include all or some of the images/charts/tables from original doc, as those elements often...

Your provided [tesseract_with_llama2_corrections.py] code snippet is equipped with the llma2 chat ggml q3 k_s.bin LLM model but the huggingface.co is referring to use GGUF instead saying the GGML is deprecated....