UFO icon indicating copy to clipboard operation
UFO copied to clipboard

Using llama3.2-vision:11b for the app agent

Open ms-cleblanc opened this issue 1 year ago • 5 comments

I'm using GPT for my host agent and the response has all of the components I would expect

DEBUG: Json string before loading: {
    "Observation": "I observe that the Google Chrome application is available from the control item list, with the title of 'New Tab - Google Chrome'.",
    "Thought": "The user request can be solely completed on the Google Chrome application. I need to open the Google Chrome application and click on the + icon in the top bar to open a new tab.",
    "CurrentSubtask": "Open a new tab in Google Chrome by clicking on the + icon in the top bar.",
    "Message": ["(1) Locate the + icon in the top bar of Google Chrome.", "(2) Click on the + icon to open a new tab."],
    "ControlLabel": "4",
    "ControlText": "New Tab - Google Chrome",
    "Status": "CONTINUE",
    "Plan": [],
    "Questions": [],
    "Comment": "I plan to open a new tab in Google Chrome by clicking on the + icon in the top bar."
}

However, when I use Ollama as my app agent the responses are not in the format UFO expects, I don't have Observations or Thoughts or even plans. I do get a decent response from the llama

DEBUG: Json string before loading: { "id": 3, "title": "Open a new tab in Google Chrome by clicking on the + icon in the top bar.", "steps": [ { "stepNumber": 1, "description": "Locate the + icon in the top bar of Google Chrome." }, { "stepNumber": 2, "description": "Click on the + icon to open a new tab." } ], "image": "annotated screenshot" }

What could I be doing wrong? How does the AppAgent know to provide Thoughts and Observations?

ms-cleblanc avatar Dec 04 '24 19:12 ms-cleblanc

This is probably because the model you use is not strong enough. We feed the same prompts to all models. If the model fails to follow instruction, it may generate different output which we do not expect.

vyokky avatar Dec 05 '24 05:12 vyokky

Thanks for your help! I upgraded my VM and ran the 90b model with the same issue. The context window is 128K just like GPT so I wonder why it's ignoring the prompt. I think Ollama wants the image as a filename rather than bytes in the context window. Do you think that change might help?

DEBUG: Json string before loading: {"control_text": "Customer Service workspace", "control_type": "TabItem", "label": "13"}

ms-cleblanc avatar Dec 05 '24 14:12 ms-cleblanc

Do we have to config for both host and app agent?

Kartik-dot-png avatar Jan 05 '25 10:01 Kartik-dot-png

How did you setup ollama, can u provide the endpoint you used in the config file and other info used in config file? I tried using server/api/generate as endpoint getting error Max retry.

Kartik-dot-png avatar Jan 07 '25 01:01 Kartik-dot-png

I was running LLAMA locally, but I think it should be just /api/generate maybe? Sorry that environment is one now so it's only my memory

ms-cleblanc avatar Jan 07 '25 20:01 ms-cleblanc