uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering issues

Results 18 uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering issues

Sort by recently updated

Encapsulated model output

Changed model output from json to Context class Updated GroupOp by removing preprocess_fn as value dictionary of all nodes are Context class now

CallmeNafiy

Bug: RecursiveSplitter removes all spaces

### 🐛 Describe the bug If I manually set the `chunk_size` in `RecursiveSplitter` to 50, it would remove all blank spaces. Running this input: ``` One of the most important...

vicshi06

Request: Customized Chunk Size for RecursiveSplitter

### 🚀 The feature, motivation and pitch Currently, there is no way to customize the chunk size variable for `RecursiveSplitter`. The [default](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/bfcb306a7eb5aaacd893ef04a8f4c209fb45990b/uniflow/op/extract/split/recursive_character_splitter.py#L21) is 1024 characters. If we are able to...

vicshi06

good first issue

Gemma support new

1. need `add_generation_prompt` for Gemma to generate good response 2. limit the response size to make sure the response is efficient 3. no `response_start_key` is added since Gemma will not...

ZHIHANCHEN03

update transform and rater

EdTeng1

unittest_load

Unit Test For Load

Real3Lee

In extract_pdf_nougat_qa.ipynb, ExtractClient(config) gets 'NoneType' object error

### 🐛 Describe the bug In extract_pdf_nougat_qa.ipynb, https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_pdf_nougat_qa.ipynb , running this block, it gets 'NoneType' object error: ``` data = [ {"filename": input_file}, ] config = ExtractPDFConfig( model_config=NougatModelConfig( model_name =...

larryyin

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering copied to clipboard

Metadata

Encapsulated model output

Bug: RecursiveSplitter removes all spaces

Request: Customized Chunk Size for RecursiveSplitter

Gemma support new

update transform and rater

unittest_load

In extract_pdf_nougat_qa.ipynb, ExtractClient(config) gets 'NoneType' object error

feat: Add ExtractS3PDFFlow

Update .py files from main branch

resolved typos, awaiting for approval

← Metadata

Owner

Metadata

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering copied to clipboard

Metadata

← Metadata

Owner

Metadata

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering copied to clipboard