capa
capa copied to clipboard
flirt: consider using rizin/sigdb signatures
https://github.com/rizinorg/sigdb
https://github.com/rizinorg/sigdb-source
We can also use this for floss.
@williballenthin ,
I have looked into adding .sig files. The paths to sig files are being loaded by viv_utils.flirt.load_flirt_signature we can make sigs folder a submodule linked to https://github.com/rizinorg/sigdb . We can add path to sigs folder as default while running capa if not specified by user.
that sounds good.
prior to doing this, would you investigate the coverage provided by the rizin sig files versus what we currently distribute? you can match FLIRT signatures against a collection of PE files, perhaps around 100, and see how often each symbol matches. then, we should be able to decide if we want to commit to rizin.
see how often each symbol matches.
Do you mean the identifying how often library functions matches (Ex. strcpy) ? I think the script to check the coverage of functions can be generated by making some modifications in scripts/match-function-id.py? As u said we can pick some 100 random PEs from tests/data and find number matches for each symbol and compare stats with current sigs?
thoughts @williballenthin
yes, exactly. you can use that script to see which functions are matched given a signature file.
i'm optimistic that both sets cover about the same functions, but i'm really not sure.
feel free to present the data in any way that makes sense to you, showing the trade offs between the two signature sets.
i can also provide a collection of random files or you can use the files in capa-testfiles. the second idea might be a bit easier since they're already available.
will go ahead with capa-testfiles. @williballenthin If possible, wouldn't it be better to directly check which source of sigs covers more symbols.
@williballenthin
In above screenshot, there are three terminals above each terminal is txt file generated using the match_function_id.py script with some modifications. The txt file contains functions found and how many times they were found. In terminal there is also order in which sigs are being compiled. Currently I am using capa/sigs files. In left and right cases sigs compiling order is reversed and in both the cases function found in line 27 is different
_exit and _Curl_hash_clean. Is it because same function has different names in in different sigs. Middle terminal shows results when running only using 3_flare_common_libs.sig. Sample being used is PMA 12-02.exe.
So, Should this be later treated as a single function or coverage as 2 different symbols.
Also in the first line is symbol '?' a valid function ?
what im most interested in is if the rizin database matches approximately the same number (or more) functions than the siglib databases. its reassuring to see about the same names in the results above! i don't think its important to investigate every difference - just the approximate total counts and coverage.
i think "?" is possibly used to indicate there were matches but its not clear which one (ambiguous match).
i dont quite understand your question, though. did the above explanation help? or if you need a different response, can you rephrase the question?
Matches for functions in PMA files are low if rizin/sigdb files are used, but rizin/sigdb shows more matches for files other than PMA ones Should I include PMA files in my tests?
yeah please include all the testfiles, if possible
@williballenthin
running
python3 scripts/match-function-id.py tests/data/6cc148363200798a12091b97a17181a1.exe_ --signature sigs/1_flare_msvc_rtf_32_64.sig
gives error at function 0x1401d4e60: 'NoneType' object is not subscriptable. Can you think of whats causing error here?
Traceback (most recent call last):
File "/Users/ayush.goel/Documents/GitHub/capa/scripts/match-function-id.py", line 134, in <module>
sys.exit(main())
^^^^^^
File "/Users/ayush.goel/Documents/GitHub/capa/scripts/match-function-id.py", line 126, in main
name = viv_utils.flirt.match_function_flirt_signatures(analyzer.matcher, vw, function)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/viv_utils/flirt.py", line 188, in match_function_flirt_signatures
loc_va = vw.getLocation(ref_va)[vivisect.const.L_VA]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable
looks like getLocation is returning None when we expect it to be a tuple. this needs a fix in viv-utils. you're welcomed to propose this or i can do it tomorrow. in the meantime, maybe update the script to catch such an exception?
looks like getLocation is returning None when we expect it to be a tuple. this needs a fix in viv-utils.
Done in PR https://github.com/williballenthin/viv-utils/pull/116
-
In case of using capa/sigs a total 8282 unique symbols were found in all PEs.
-
In case of using rizin/sigdb a total 5781 unique symbols were found in all PEs.
-
In all symbol matches found in all PEs using capa and rizin sigs only 1099 functions had same names. Below txt files conatins unique functions found using capa and rizin sigs. allFounds_capa_sigs.txt allFounds_rizin_sigs.txt
-
For all 219 .exe_ files in capa-testfiles processed Rizin gave more matches for 101 PEs as compared to capa. For rest 118 PEs capa gave better results.
Below graph shows results of number unique function matches on all files using capa and rizin sigs.
Below image shows difference in matches for all files.
The below .xls contains results for above graphs
@williballenthin Based on above results it would be better to stick to current Capa sigs. What are your thoughts !
Interesting results. Do you have insight into which files rizin handles better than the capa signatures? Maybe we can leverage a subset of the rizin rules?
In the excel file attached above it mentions which files rizin handles better. Currently, I don't know a way to classify the PE files used for testing. Do you have any suggestions of how I can classify files ?
detect it easy could shed some light on compiler/linker versions
Sharing compiler and linkers results found using Detect it easy. In below excel empty cells means no results from DIEC. updatedMatches.xlsx
In first look, I wasn't able to detect any patterns in files which rizin handles better, will continue looking into it. @mr-tz @williballenthin Could you share your views if u found any patterns based on results in above excel file.
Cool, thanks for the research here. Seems like rizin does slightly better on
Microsoft Linker(14.11, Visual Studio 2017 15.3*)[Console32,console]
Microsoft Linker(14.0, Visual Studio 2015 14.0*)[Console64,console]
Microsoft Linker(14.0, Visual Studio 2015 14.0*)[GUI32]
Microsoft Linker(14.26, Visual Studio 2019 16.6*)[Console32,console]
Microsoft Linker(14.0, Visual Studio 2015 14.0*)[GUI64]
Microsoft Linker(14.0, Visual Studio 2015 14.0*)[Console32,console]
Microsoft Linker(14.32, Visual Studio 2022 17.2*)[GUI64]
But overall, our signatures seem to do pretty well and I don't see a reason to change based on this.
@mr-tz, thanks for reviewing the results. Let me know if you have any further thoughts or if there's anything else regarding this.
@williballenthin, what do you think of these results, can we close this issue (for now)?
yeah, i agree, lets keep using the signatures that we have; however, if they become noticably out of data and/or rizin introduces other relevant signatures, lets consider if it become worthwhile to switch over.
@Aayush-Goel-04 thank you very much for taking the time to do this data exploration. although it didn't lead to a merged PR, this is a better outcome - no additional work. i appreciated the way you collected data and presented the results, sharing raw information when we asked for it. 🙇🏼♂️