semantic-kernel Function inputs and escaping special characters with Unicode

Discussed in https://github.com/microsoft/semantic-kernel/discussions/7308

^{Originally posted by glorious-beard July 16, 2024} If I set a kernel argument to content containing special characters, (HTML tags, for example), and I look at the logger output from the kernel when it's invoking the function, I notice that the JSON object escapes all of the special character.

For example, if I set "input" to "

Version 1.2

....", the function argument looks like:

{"input":"\u003Cp\u003EVersion 1.2\u003C/p\u003E..."}

Two questions:

Do the extra characters in escaping "<" and ">" with 5 additional characters incur extra token cost?
Does the function call unescape these characters before it is sent to the LLM endpoint?

Jul 19 '24 15:07 sophialagerkranspandey

I'm pretty sure that the characters are only encoded so we can print the log statement (so it shouldn't impact your logic), but adding folks to verify.

Jul 19 '24 15:07 madsbolaris

In python, I can confirm, they are unescaped before being sent to the model, this happens within the from_element method for chat, and within the _invoke_internal method for text, hence it also does not add extra tokens (although tokenization on the model side might). @sophialagerkranspandey @glorious-beard

Jul 22 '24 08:07 eavanvalkenburg

We have protection to prevent prompt injection attacks which will encode potentially dangerous tags. If you trust the content you can change this behaviour, take a look at this sample to see the available options: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/ChatPrompts/SafeChatPrompts.cs

Jul 24 '24 12:07 markwallace-microsoft

Closing this issue since it's handled in both C# and Python

Jul 26 '24 15:07 madsbolaris