open-parse
                                
                                
                                
                                    open-parse copied to clipboard
                            
                            
                            
                        🚀 Roadmap
Description
This is a tentative roadmap, I will update it as things evolve.
Roadmap
High Priority:
- [x] Implement unitable
 - [x] Enable OCR support
 - [ ] Different embedding providers [in-progress]
 - [ ] Better table detection
 - [x] LlamaIndex integration
 
Long Term:
- [ ] Create a docker image with fastapi for non python users
 - [ ] Add support for ImageElements
 - [ ] More automated eval suite
 - [ ] Better OCR provider
 - [ ] Speed up parsing. Due to the way we construct TextSpan this can be quite slow especially on documents with tons of tables
 - [ ] Add 
embed_textproperty, useful on tables where embedding the contents performs poorly 
Hey @Filimoa do you plan to add support for unitable anytime soon? Seems like the doc mentions it but the notebook does not have an example for it. Thanks for creating this project.
Hey @Filimoa do you plan to add support for unitable anytime soon? Seems like the doc mentions it but the notebook does not have an example for it. Thanks for creating this project.
As soon as the pre-trained weights are released I'll be adding it. I talked with the ShengYun earlier this week and sounds like they'll be released ASAP.
@Filimoa Looks like pretrained weights are available now! :)
In progress! Should be merged in by the end of the week.
Just merged - try it out, it will require downloading weights which you can read about here. We need to find a better model for table detection but this performs incredibly well otherwise.
Hey @Filimoa! Really great project!! Have you thought about using open source models for the semantic processing? You can find even better embedding models here: https://huggingface.co/spaces/mteb/leaderboard Especially this one is really promising (only 0.67GB & better than text-embedding-3-large): https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 There are also ONNX models, running pretty fast on CPUs.
Added to the roadmap! Will ship very soon @Ulipenitz
Would be great to support Azure OpenAI as well.
Hey @Filimoa ! Have you try PaddleOCR ? As for me, this project have well performance for Layout Analysis and Table Recognition