Yara cannot scan directories and files with non-ANSI characters in the name
Yara code uses char (8 bit) character type and string functions as well as ANSI api's in Windows to list directories and open files. As a result an "error scanning .. cannot open file" message is printed whenever a file is encountered where the file name contains 1 or more non-ANSI characters. Vice versa for directories. A different code page cannot solve this problem either. The result is that a single non-ANSI character can make a file or entire directory unavailable for scanning. Since there are many non-ANSI characters that print (almost) identically to existing ANSI characters, it is easy to hide the the reason why a file or directory could not be scanned. It is a serious security risk. Malware only needs to use a single non-ANSI character in a file name to bypass scanning. Moreover that character could print identically to an existing ANSI character, hiding effectively the reason of the error.
Windows version only. Here is an illustration of the issue. I have a simple directory called Malware. There is a yara script file, malware.yar that contains:
rule no_error_scanning { strings: $a = "malware" condition: $a }
The directory structure is:
_Directory of C:\Users\Malware
08-10-20 17:34 <DIR> . 08-10-20 17:34 <DIR> .. 08-10-20 17:19 <DIR> cаnnot_scan_this_dir 08-10-20 17:46 15 malware.exe 08-10-20 17:16 76 malware.yar 08-10-20 17:01 15 mаlware.exe 08-10-20 17:18 <DIR> this_dir_is_scanned 3 File(s) 106 bytes
Directory of C:\Users\Malware\cаnnot_scan_this_dir
08-10-20 17:19 <DIR> . 08-10-20 17:19 <DIR> .. 08-10-20 17:46 15 other_malware.exe 1 File(s) 15 bytes
Directory of C:\Users\Malware\this_dir_is_scanned
08-10-20 17:18 <DIR> . 08-10-20 17:18 <DIR> .. 08-10-20 17:46 15 some_malware.exe 1 File(s) 15 bytes_
There is a test file, called malware.exe that is just a basic text file with the following contents:
This is malware
The time stamp is 17:46. There is also a another file that appears to have the same name but with time stamp 17:01. The contents is identical, but the name contains a cyrillic letter 'a' (unicode 0x0430) instead of a latin alfabet 'a'.
There is a directory "this_dir_is_scanned" that contains an identical copy of the malware.exe file with timestamp 17:46. The same for another directory "cаnnot_scan_this_dir". However, that directory name again contains a cyrillic letter 'a'.
Here is the result of the scan, yara version 4.02:
yara -r malware.yar . error scanning .\m?lware.exe: could not open file no_error_scanning .\malware.exe no_error_scanning .\malware.yar no_error_scanning .\this_dir_is_scanned\some_malware.exe
The directory cаnnot_scan_this_dir is not scanned and is also not reported as not scanned. The file open error concerns the file with the unicode letter 'a'.