uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering copied to clipboard
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
Changed model output from json to Context class Updated GroupOp by removing preprocess_fn as value dictionary of all nodes are Context class now
### 🐛 Describe the bug If I manually set the `chunk_size` in `RecursiveSplitter` to 50, it would remove all blank spaces. Running this input: ``` One of the most important...
### 🚀 The feature, motivation and pitch Currently, there is no way to customize the chunk size variable for `RecursiveSplitter`. The [default](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/bfcb306a7eb5aaacd893ef04a8f4c209fb45990b/uniflow/op/extract/split/recursive_character_splitter.py#L21) is 1024 characters. If we are able to...
1. need `add_generation_prompt` for Gemma to generate good response 2. limit the response size to make sure the response is efficient 3. no `response_start_key` is added since Gemma will not...
Unit Test For Load
### 🐛 Describe the bug In extract_pdf_nougat_qa.ipynb, https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_pdf_nougat_qa.ipynb , running this block, it gets 'NoneType' object error: ``` data = [ {"filename": input_file}, ] config = ExtractPDFConfig( model_config=NougatModelConfig( model_name =...
A new PDF flow to allow S3 download and processing.