Function inputs and escaping special characters with Unicode
Discussed in https://github.com/microsoft/semantic-kernel/discussions/7308
Originally posted by glorious-beard July 16, 2024 If I set a kernel argument to content containing special characters, (HTML tags, for example), and I look at the logger output from the kernel when it's invoking the function, I notice that the JSON object escapes all of the special character.
For example, if I set "input" to "
Version 1.2
....", the function argument looks like:{"input":"\u003Cp\u003EVersion 1.2\u003C/p\u003E..."}
Two questions:
- Do the extra characters in escaping "<" and ">" with 5 additional characters incur extra token cost?
- Does the function call unescape these characters before it is sent to the LLM endpoint?
I'm pretty sure that the characters are only encoded so we can print the log statement (so it shouldn't impact your logic), but adding folks to verify.
In python, I can confirm, they are unescaped before being sent to the model, this happens within the from_element method for chat, and within the _invoke_internal method for text, hence it also does not add extra tokens (although tokenization on the model side might). @sophialagerkranspandey @glorious-beard
We have protection to prevent prompt injection attacks which will encode potentially dangerous tags. If you trust the content you can change this behaviour, take a look at this sample to see the available options: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/ChatPrompts/SafeChatPrompts.cs
Closing this issue since it's handled in both C# and Python