skyvern How to use local vision model to replace gpt-4 turbo?

I am interested in this project, I tried a lot and find this work very well. But this seems have to use a lot token of gpt, because of screenshot processing. I tried to replace gpt by local other vision model, but not find where should I modify? where is gpt vision used in the source code?

Apr 12 '24 03:04 DDDog-WANG

I was perusing the codebase looking for the same answer. Afaict when it calls gpt 4 vision (or whatever other model you happen to specify) it happens here: https://github.com/Skyvern-AI/skyvern/blob/31e1470c6ff745124af2201d8e3996c17702d4fc/skyvern/forge/sdk/api/llm/api_handler_factory.py#L98. Notice how screenshot data is sent along with perhaps a text prompt. That being said, I dunno if it will be super simple to replace your own local vision model or not.

Apr 30 '24 01:04 djkramnik

I perused source code, and found I have missunderstanding the role of vision-model. Vision-model in this project is not segment or locate all element (this achevied by JS script), but just check whether anything bad happen.

Apr 30 '24 01:04 DDDog-WANG

https://github.com/Skyvern-AI/skyvern/pull/251/files

New models can be added similar to the approach here - you could try it out with hosted ollama models once https://github.com/Skyvern-AI/skyvern/issues/242 is implemented

May 03 '24 08:05 suchintan

This issue is stale because it has been open for 30 days with no activity.

Jun 03 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 17 '24 01:06 github-actions[bot]