[Bug-Candidate]: Source mapping indexes exceed source length when special characters are present in Solidity files
Describe the issue:
We are encountering an issue with Slither’s source mapping when analyzing Solidity files that include certain special Unicode characters. In our use case, the source mapping index returned for some functions exceeds the length of the source file. For example, we observed that for a file with a total length of 54,863 characters, Slither reported an internal function with a start index of 55,053.
After investigation, we suspect that the presence of characters such as:
+.*•´.*:˚.°*.˚
≈
½
may be causing encoding or processing issues within Slither (or its underlying CryticCompile component), leading to miscalculation of character positions.
Steps to Reproduce:
- Create a Solidity file (e.g.,
Test.sol) that includes a library or contract containing these special characters in comments or string literals. - Run Slither (via CryticCompile) on the file.
- Observe that the source mapping for at least one function returns a start index greater than the total file length.
Code example to reproduce the issue:
(https://github.com/Vectorized/solady/blob/main/src/utils/FixedPointMathLib.sol)
Version:
0.11.0
Relevant log output:
Hi @nivcertora ! Thanks for the report. If you have some time, could you check if the changes in https://github.com/crytic/slither/pull/2662 improve the situation?
From my understanding, the issue stems from the fact that the offsets are in bytes, not "characters" in the string sense, which causes differences when you have multi-byte characters in your source code.
Thanks for the quick response. Is there an easy way to install the version with the changes?
You should be able to install it from the PR branch with
pip install https://github.com/crytic/slither/archive/refs/heads/fix-unicode-src-mappings.zip
Did you notice if there's a specific detector that's reporting misaligned source maps? Or are you writing a Python script that checks the Source objects directly?
I wrote a Python script, and it looks like there is still an offset issue. Here, I compare the offsets between a regex-based function and the source mapping
Could you use function.source_mapping.content? This will handle encoding to translate the mapping values to a char position in the source code. If you're trying to map manually, be aware that index values such as source_mapping.start are the byte offset, not the char offset so you'd need to do something like this