startbootstrap
startbootstrap copied to clipboard
Add parser for PowerShell transcript files
One line description of pull request
This PR adds a parser for PowerShell transcript files.
Description:
If enabled (e.g. via Start-Transcript) PowerShell transcript files record all or part of a PowerShell session to a text file. This PR adds a parser for these transcript files. Unfortunately the format can vary depending on system language and configuration. I tried to write this parser as general as possible and tested it with transcripts from different systems.
Notes:
All contributions to Plaso undergo code review. This makes sure that the code has appropriate test coverage and conforms to the Plaso style guide.
One of the maintainers will examine your code, and may request changes. Check off the items below in order, and then a maintainer will review your code.
Checklist:
- [ ] Automated checks (Travis, Codecov, Codefactor )pass
- [x] No new new dependencies are required or l2tdevtools has been updated
- [x] Reviewer assigned
Codecov Report
Patch coverage: 89.92
% and project coverage change: +0.01
:tada:
Comparison is base (
b035163
) 85.88% compared to head (3225147
) 85.89%.
:exclamation: Current head 3225147 differs from pull request most recent head 39ca2c1. Consider uploading reports for the commit 39ca2c1 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #4168 +/- ##
==========================================
+ Coverage 85.88% 85.89% +0.01%
==========================================
Files 411 412 +1
Lines 35097 35226 +129
==========================================
+ Hits 30143 30259 +116
- Misses 4954 4967 +13
Impacted Files | Coverage Δ | |
---|---|---|
...laso/parsers/text_plugins/powershell_transcript.py | 89.92% <89.92%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hi @jleaniz - thanks for your fast review! I took care of your style guide comments and hopefully fixed all occurrences. If you happen to find any more: don't hesitate to give me a hint.
Regarding the parser interface:
I experimented with the PyparsingMultiLineTextParser
in the beginning but was not able to make it work. My problem was that the structure of the transcript files tends to vary quite a bit. E.g. the detection of "is this line still a command or not" seemed very complicated with the TextParsers since these lines can contain basically everything - but that may also be due to me being inexperienced with writing Plaso Parsers...
In the end, I felt that it was easier to map the lines to their respective field when being able to "view" the whole file at once which is why I chose to use the FileObjectParser
.
Thanks for addressing my comments. Regarding the choice of parser interface, I think that's a fair reason to use FileObjectParser
.
@jleaniz thanks again for your review! I implemented a few of your comments and elaborated on the other ones directly under your comment.
@Onager thanks for your review. As I commented a while ago, I tried to implement this parser initially with the multilineparser but couldn't quite make it work. I will take a look at the multilineparser once again and see if I can find a way to get it working this time. Thanks also for your gist - if that's fine I would just reach out in case of further questions. I will also happily provide more sample data for testing.
Unfortunately, all of this might take some time - so if you prefer that, I could open another PR once I am done.
Apologies for missing the earlier discussion about using pyparsing - I'll take a look and see if I can spot any solutions. Definitely feel free to reach out directly if you have questions.
Hi @Onager and @jleaniz, I found some time to look at this again and managed to rewrite the parser based on pyparsingmultilineparser. Please take a look when you have some time and let me know what you think.
A few preliminary remarks:
- Unfortunately, it proved pretty hard to write a pyparsing grammar that would parse all lines correctly by itself. You can see an example for this in the new
powershell_transcript_ger.txt
test file: While line 35 is actually atranscript_line
, it looks like ametadata_line
for the parser. For this reason I had to add some logic to "re-classify" lines in some cases (see lines 309ff. in the parser). - Per line 300ff. in the parser, the last line of a transcript will currently not be parsed correctly, if this transcript is not closed cleanly (which it will be most of the time). To do this correctly, the parser would need to know, when it got the last line of a text file. I have not found a way to accomplish this - maybe one of you has an idea?
I'm looking forward to your review!
Hi @sydp ,
I just implemented most of the changes you suggested.
The __del__
is indeed no longer needed after I switched to instance variables and the since the file is now parsed with the right encoding, special characters are shown correctly now - thanks for the pointing me in the right direction!
@Onager @FabFaeb what is the status of this PR?
@joachimmetz This is waiting for me, I started reviewing this the other day but ran into some format issues
@Onager this PR needs to be brought up to date as well.
@FabFaeb PTAL I've updated the code to work with HEAD and various clean up. Please check if the parser still works with your understanding of the format. I also left a couple of questions / comment.
@Onager PTAL from a code review perspective and seeing you appear to have looked at the format a bit
Hi @joachimmetz,
thanks for taking a look at this!
I just had a few minutes to check your changes and it seems like the parser/plugin does not work correctly anymore. While it still produces the expected amount of events, the events all seem to be duplicates of the last entry in the transcript. E.g. for the provided powershell_transcript_ger.txt
example, the ...ping 127.0.0.1 -n 5...
event/entry is created two times.
Unfortunately, I don't have too much time to look deeper into this right now. Maybe you have a quick idea about this issue? If I have some time at my hands in the coming weeks, I'll also try to debug this further and report back here.
I'll answer to your inline comments in a minute.
the events all seem to be duplicates of the last entry in the transcript.
let me take a look at that, thx for flagging. Might be related to the use of copy previously.
Unfortunately, I don't have too much time to look deeper into this right now.
No hurry, coming weeks is fine.
re-added the use of copy.deepcopy of the event_data object
With your last commit, the events are created correctly again!
I found that you removed my TODO
comment regarding the last entry in a transcript that was not exited cleanly
https://github.com/log2timeline/plaso/blob/15de114d41a3390dcfabf042fc8993313ed0c7e7/plaso/parsers/powershell_transcript.py#L318
However, I think that this is still an open issue, right? I stumbled upon this due to the still existing comment in the test file.
https://github.com/log2timeline/plaso/blob/e8c36a99e7ce727e03ec6ab1a30978f82b21cc13/tests/parsers/text_plugins/powershell_transcript.py#L61
In short: if a transcript is exited suddenly, and no footer info is added, currently the last entry is not parsed. E.g. in the powershell_transcript_ger.txt
, the last entry PS C:\Users\User> rm .\filename
is not added as an event, since the finalizing separator line is not there. This is also the case in the most recent version.
As I wrote in my original comment, I think we needed to know that the currently parsed line is the last line in the file, so that we could wrap up the last event. Do you see any way to do this or do you maybe have another idea how to fix this? Or do we have to just accept this limitation?
Besides that, now everything seems to be working as expected to me.
I found that you removed my TODO comment regarding the last entry in a transcript that was not exited cleanly
I replaced it with https://github.com/log2timeline/plaso/pull/4168/files#diff-d263ca460ae4064ef96c2981245da460bd3a4cf0438265403bb7715a14a305e3R128 but needed to know some more context first. I might have an idea to add support for it be as processing warnings to begin with.
Note to self add footer support / incomplete file warning