refactor: document processing toolkit and excel toolkit
This PR refactors document processing toolkit and excel toolkit.
Links #1925
thanks @JINO-ROHIT ,please try to pre-commit and add an example
sure!
@zjrwtx done!
not sure why the pre-commit fails here, locally it passes
thanks very much for the review, will make the changes,
@a7m-1st will help review the PR
sure @a7m-1st, can you let me know is the fixes are alright?
In general the csv, xls, xlxs flow should be same as before (except that you added header only) ✅, thanks for patching the issues pointed by raywhoelse. Docx processing worked for me, but perhaps handling possible async events is a must ⚠️.
One thing I realized, make sure to add tests for:
There may be some security issues here, whether it is a malicious file, whether the file size exceeds the limit, and whether the file type is supported.
Other than that, I will inform you once I am done asap. Thanks.
cool @a7m-1st can you check the updated fix and lmk what else needs to be changed?
cool @a7m-1st can you check the updated fix and lmk what else needs to be changed?
@JINO-ROHIT I have been testing your tool and realized some small bugs such as the dummy Excel file in examples/document_toolkit in Windows getting locked by a process. Can I ask if you tried to run example/document_toolkit with groq or gemini? In the example Gemini Flash 2.0 is saying "I Can't open links", although mentioned to use document_toolkit, given:
Analyze the file at "https://pdfobject.com/pdf/sample.pdf" and tell me what data it contains.
Honestly as long as you update to using the Firecrawl & Chunkr API, everything should be good. Just need to recheck example and add relevant tests. I am free tonight; I will push the small changes I mentioned by then.
ahh i would recommedn to use openai, i use openai only especially when tools are involved