autogen Parse Any HTML-esh Style Tags

Why are these changes needed?

This PR introduces an enhancement to the way LMM agents communicate with each other. Currently, agents use HTML-style tags embedded in text messages to share image information (e.g., Hi, take a look at this image <img http://example.com/something.jpg>). Expanding the tagging system to audio and video could be beneficial as LMM agents' capabilities expand.

The key improvements in this PR include:

Support for Multiple Tag Types: The new implementation can handle an unlimited number of different tag types.
Attribute Parsing: The new implementation can parse attributes within tags. This allows us to include additional information within each tag.

Some examples:

<audio prompt="Whisper" text="Hello autogen" task="generate">, the output would be {"tag": "audio", "content": {"prompt": "whisper", "text": "Hello autogen", "task": "generate"}}.
<video https://example.com prompt="change all red cars to yellow" tasks="modify">, the output would be {"tag": "video", "content": {"prompt": "change all red cars to yellow", "task": "modify"}}.

NOTE: As you may have noticed, the implementation is complex because I don't know what I'm doing when it comes to regex.

I also thought of fully supporting HTML, simplifying the implementation. However, I prioritized backward compatibility instead.

Related issue number

Checks

[ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
[x] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[x] I've made sure all auto checks have passed.

Mar 18 '24 02:03 WaelKarkoub

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 48.92%. Comparing base (59a7790) to head (8e42f59).

Files	Patch %	Lines
autogen/agentchat/utils.py	87.93%	3 Missing and 4 partials :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2046       +/-   ##
===========================================
+ Coverage   36.73%   48.92%   +12.18%     
===========================================
  Files          69       69               
  Lines        7134     7190       +56     
  Branches     1557     1700      +143     
===========================================
+ Hits         2621     3518      +897     
+ Misses       4280     3371      -909     
- Partials      233      301       +68

Flag	Coverage Δ
unittests	`48.78% <89.06%> (+12.06%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 18 '24 02:03 codecov-commenter

@WaelKarkoub I really like this PR. One comment about variable name: the "content" has a special meaning in OpenAI API. Can we use a different variable name for the HTML tags, such as "parameter", "attr", etc.?

For instance,

    {
        "message": "Can you describe what's in this image <img http://example.com/image.png width='100'> and this image <img http://hello.com/image=.png>?",
        "expected": [
            {"tag": "img", "content": {"src": "http://example.com/image.png", "width": "100"}},
            {"tag": "img", "content": {"src": "http://hello.com/image=.png"}},
        ],
    },

It can be

... # skip details here
            {"tag": "img", "attr": {"src": "http://hello.com/image=.png"}},

Mar 21 '24 19:03 BeibinLi

I'm often passing html, or markdown with html tags. How would this integrate? How will it know when tags should and shouldn't be expanded?

Mar 21 '24 19:03 afourney

I'm often passing html, or markdown with html tags. How would this integrate? How will it know when tags should and shouldn't be expanded?

Any suggestions to "escape" these tags for actual HTML content in the message? An intuitive way is user_proxy would send a list rather than str as input for the message, so this parsing would not be triggered, and no code change needed. However, users might not follow the coding convention.

Mar 21 '24 21:03 BeibinLi

@afourney good questions. Currently, the PR does not distinguish between HTML, markdown, or any other tag types. However, it is up to the dev to decide whether the tags should be expanded. For instance, in https://github.com/microsoft/autogen/pull/2098, to expand a tag, the tag name must be audio and should have two attributes: file_path/text_file and task. If these attributes are present, the agent considers it to be a valid tag. Although there is an audio tag in HTML, the difference in attributes means that there is unlikely to be any confusion (although it is still possible).

As @BeibinLi has mentioned, the user can input a list instead of a string for the messages. However, how would text-based LLMs (agents) communicate with multimodal agents? For example in https://github.com/microsoft/autogen/pull/2098, any text-based agent can ask for an audio file to be transcribed by other agents that have the capability to do so. And can be expanded to other modalities: images, videos, etc.

I am open to hearing other suggestions, as I am not set on this particular solution. My main goal was to explore how agents with different modalities can interact with each other, and I discovered that using text made the most sense.

Mar 21 '24 23:03 WaelKarkoub

@afourney good questions. Currently, the PR does not distinguish between HTML, markdown, or any other tag types. However, it is up to the dev to decide whether the tags should be expanded. For instance, in #2098, to expand a tag, the tag name must be audio and should have two attributes: file_path/text_file and task. If these attributes are present, the agent considers it to be a valid tag. Although there is an audio tag in HTML, the difference in attributes means that there is unlikely to be any confusion (although it is still possible).

As @BeibinLi has mentioned, the user can input a list instead of a string for the messages. However, how would text-based LLMs (agents) communicate with multimodal agents? For example in #2098, any text-based agent can ask for an audio file to be transcribed by other agents that have the capability to do so. And can be expanded to other modalities: images, videos, etc.

I am open to hearing other suggestions, as I am not set on this particular solution. My main goal was to explore how agents with different modalities can interact with each other, and I discovered that using text made the most sense.

I think this could break some web surfer scenarios. We'll have to test.

Mar 22 '24 23:03 afourney

@afourney What do you think about using Jinja style tags? So instead of <img file.png>, we can do {% tag='img' file_path='file.png'%} or {{ tag='img' file_path='file.png' }}

Mar 23 '24 13:03 WaelKarkoub

@WaelKarkoub It seems like some test failed.

Mar 23 '24 17:03 BeibinLi

@BeibinLi, I’ve made some updates. The tests have been fixed and now the function returns the full re.Match object, along with the tag and attr. This means users can access more information about the match, such as the exact position of the match, and the match subgroups. Let me know what you think.

Mar 23 '24 18:03 WaelKarkoub