camel icon indicating copy to clipboard operation
camel copied to clipboard

refactor: document processing toolkit and excel toolkit

Open JINO-ROHIT opened this issue 9 months ago • 11 comments

This PR refactors document processing toolkit and excel toolkit.

Links #1925

JINO-ROHIT avatar Mar 26 '25 07:03 JINO-ROHIT

thanks @JINO-ROHIT ,please try to pre-commit and add an example

zjrwtx avatar Mar 26 '25 11:03 zjrwtx

sure!

JINO-ROHIT avatar Mar 26 '25 11:03 JINO-ROHIT

@zjrwtx done!

JINO-ROHIT avatar Mar 26 '25 11:03 JINO-ROHIT

not sure why the pre-commit fails here, locally it passes

JINO-ROHIT avatar Mar 26 '25 12:03 JINO-ROHIT

thanks very much for the review, will make the changes,

JINO-ROHIT avatar Mar 27 '25 06:03 JINO-ROHIT

@a7m-1st will help review the PR

Wendong-Fan avatar Mar 27 '25 17:03 Wendong-Fan

sure @a7m-1st, can you let me know is the fixes are alright?

JINO-ROHIT avatar Mar 27 '25 18:03 JINO-ROHIT

In general the csv, xls, xlxs flow should be same as before (except that you added header only) ✅, thanks for patching the issues pointed by raywhoelse. Docx processing worked for me, but perhaps handling possible async events is a must ⚠️.

One thing I realized, make sure to add tests for:

There may be some security issues here, whether it is a malicious file, whether the file size exceeds the limit, and whether the file type is supported.

Other than that, I will inform you once I am done asap. Thanks.

a7m-1st avatar Mar 29 '25 20:03 a7m-1st

cool @a7m-1st can you check the updated fix and lmk what else needs to be changed?

JINO-ROHIT avatar Apr 01 '25 08:04 JINO-ROHIT

cool @a7m-1st can you check the updated fix and lmk what else needs to be changed?

@JINO-ROHIT I have been testing your tool and realized some small bugs such as the dummy Excel file in examples/document_toolkit in Windows getting locked by a process. Can I ask if you tried to run example/document_toolkit with groq or gemini? In the example Gemini Flash 2.0 is saying "I Can't open links", although mentioned to use document_toolkit, given:

Analyze the file at "https://pdfobject.com/pdf/sample.pdf" and tell me what data it contains.

Honestly as long as you update to using the Firecrawl & Chunkr API, everything should be good. Just need to recheck example and add relevant tests. I am free tonight; I will push the small changes I mentioned by then.

a7m-1st avatar Apr 03 '25 09:04 a7m-1st

ahh i would recommedn to use openai, i use openai only especially when tools are involved

JINO-ROHIT avatar Apr 03 '25 10:04 JINO-ROHIT