How to use local vision model to replace gpt-4 turbo?
I am interested in this project, I tried a lot and find this work very well. But this seems have to use a lot token of gpt, because of screenshot processing. I tried to replace gpt by local other vision model, but not find where should I modify? where is gpt vision used in the source code?
I was perusing the codebase looking for the same answer. Afaict when it calls gpt 4 vision (or whatever other model you happen to specify) it happens here: https://github.com/Skyvern-AI/skyvern/blob/31e1470c6ff745124af2201d8e3996c17702d4fc/skyvern/forge/sdk/api/llm/api_handler_factory.py#L98. Notice how screenshot data is sent along with perhaps a text prompt. That being said, I dunno if it will be super simple to replace your own local vision model or not.
I perused source code, and found I have missunderstanding the role of vision-model. Vision-model in this project is not segment or locate all element (this achevied by JS script), but just check whether anything bad happen.
https://github.com/Skyvern-AI/skyvern/pull/251/files
New models can be added similar to the approach here - you could try it out with hosted ollama models once https://github.com/Skyvern-AI/skyvern/issues/242 is implemented
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.