startbootstrap Add parser for PowerShell transcript files

One line description of pull request

This PR adds a parser for PowerShell transcript files.

Description:

If enabled (e.g. via Start-Transcript) PowerShell transcript files record all or part of a PowerShell session to a text file. This PR adds a parser for these transcript files. Unfortunately the format can vary depending on system language and configuration. I tried to write this parser as general as possible and tested it with transcripts from different systems.

Notes:

All contributions to Plaso undergo code review. This makes sure that the code has appropriate test coverage and conforms to the Plaso style guide.

One of the maintainers will examine your code, and may request changes. Check off the items below in order, and then a maintainer will review your code.

Checklist:

[ ] Automated checks (Travis, Codecov, Codefactor )pass
[x] No new new dependencies are required or l2tdevtools has been updated
[x] Reviewer assigned

Jul 21 '22 13:07 FabFaeb

Codecov Report

Patch coverage: 89.92% and project coverage change: +0.01 :tada:

Comparison is base (b035163) 85.88% compared to head (3225147) 85.89%.

:exclamation: Current head 3225147 differs from pull request most recent head 39ca2c1. Consider uploading reports for the commit 39ca2c1 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4168      +/-   ##
==========================================
+ Coverage   85.88%   85.89%   +0.01%     
==========================================
  Files         411      412       +1     
  Lines       35097    35226     +129     
==========================================
+ Hits        30143    30259     +116     
- Misses       4954     4967      +13

Impacted Files	Coverage Δ
...laso/parsers/text_plugins/powershell_transcript.py	`89.92% <89.92%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Jul 24 '22 09:07 codecov[bot]

Hi @jleaniz - thanks for your fast review! I took care of your style guide comments and hopefully fixed all occurrences. If you happen to find any more: don't hesitate to give me a hint.

Regarding the parser interface: I experimented with the PyparsingMultiLineTextParser in the beginning but was not able to make it work. My problem was that the structure of the transcript files tends to vary quite a bit. E.g. the detection of "is this line still a command or not" seemed very complicated with the TextParsers since these lines can contain basically everything - but that may also be due to me being inexperienced with writing Plaso Parsers... In the end, I felt that it was easier to map the lines to their respective field when being able to "view" the whole file at once which is why I chose to use the FileObjectParser.

Jul 26 '22 11:07 FabFaeb

Thanks for addressing my comments. Regarding the choice of parser interface, I think that's a fair reason to use FileObjectParser.

Jul 27 '22 18:07 jleaniz

@jleaniz thanks again for your review! I implemented a few of your comments and elaborated on the other ones directly under your comment.

Jul 29 '22 16:07 FabFaeb

@Onager thanks for your review. As I commented a while ago, I tried to implement this parser initially with the multilineparser but couldn't quite make it work. I will take a look at the multilineparser once again and see if I can find a way to get it working this time. Thanks also for your gist - if that's fine I would just reach out in case of further questions. I will also happily provide more sample data for testing.

Unfortunately, all of this might take some time - so if you prefer that, I could open another PR once I am done.

Aug 02 '22 15:08 FabFaeb

Apologies for missing the earlier discussion about using pyparsing - I'll take a look and see if I can spot any solutions. Definitely feel free to reach out directly if you have questions.

Aug 02 '22 15:08 Onager

Hi @Onager and @jleaniz, I found some time to look at this again and managed to rewrite the parser based on pyparsingmultilineparser. Please take a look when you have some time and let me know what you think.

A few preliminary remarks:

Unfortunately, it proved pretty hard to write a pyparsing grammar that would parse all lines correctly by itself. You can see an example for this in the new powershell_transcript_ger.txt test file: While line 35 is actually a transcript_line, it looks like a metadata_line for the parser. For this reason I had to add some logic to "re-classify" lines in some cases (see lines 309ff. in the parser).
Per line 300ff. in the parser, the last line of a transcript will currently not be parsed correctly, if this transcript is not closed cleanly (which it will be most of the time). To do this correctly, the parser would need to know, when it got the last line of a text file. I have not found a way to accomplish this - maybe one of you has an idea?

I'm looking forward to your review!

Aug 26 '22 12:08 FabFaeb

Hi @sydp , I just implemented most of the changes you suggested. The __del__ is indeed no longer needed after I switched to instance variables and the since the file is now parsed with the right encoding, special characters are shown correctly now - thanks for the pointing me in the right direction!

Sep 02 '22 12:09 FabFaeb

@Onager @FabFaeb what is the status of this PR?

Nov 01 '22 05:11 joachimmetz

@joachimmetz This is waiting for me, I started reviewing this the other day but ran into some format issues

Nov 01 '22 16:11 Onager

@Onager this PR needs to be brought up to date as well.

Nov 01 '22 17:11 joachimmetz

@FabFaeb PTAL I've updated the code to work with HEAD and various clean up. Please check if the parser still works with your understanding of the format. I also left a couple of questions / comment.

Jan 02 '23 08:01 joachimmetz

@Onager PTAL from a code review perspective and seeing you appear to have looked at the format a bit

Jan 02 '23 08:01 joachimmetz

Hi @joachimmetz, thanks for taking a look at this! I just had a few minutes to check your changes and it seems like the parser/plugin does not work correctly anymore. While it still produces the expected amount of events, the events all seem to be duplicates of the last entry in the transcript. E.g. for the provided powershell_transcript_ger.txt example, the ...ping 127.0.0.1 -n 5... event/entry is created two times.

Unfortunately, I don't have too much time to look deeper into this right now. Maybe you have a quick idea about this issue? If I have some time at my hands in the coming weeks, I'll also try to debug this further and report back here.

I'll answer to your inline comments in a minute.

Jan 06 '23 15:01 FabFaeb

the events all seem to be duplicates of the last entry in the transcript.

let me take a look at that, thx for flagging. Might be related to the use of copy previously.

Unfortunately, I don't have too much time to look deeper into this right now.

No hurry, coming weeks is fine.

Jan 07 '23 02:01 joachimmetz

re-added the use of copy.deepcopy of the event_data object

Jan 07 '23 08:01 joachimmetz

With your last commit, the events are created correctly again!

I found that you removed my TODO comment regarding the last entry in a transcript that was not exited cleanly https://github.com/log2timeline/plaso/blob/15de114d41a3390dcfabf042fc8993313ed0c7e7/plaso/parsers/powershell_transcript.py#L318 However, I think that this is still an open issue, right? I stumbled upon this due to the still existing comment in the test file. https://github.com/log2timeline/plaso/blob/e8c36a99e7ce727e03ec6ab1a30978f82b21cc13/tests/parsers/text_plugins/powershell_transcript.py#L61 In short: if a transcript is exited suddenly, and no footer info is added, currently the last entry is not parsed. E.g. in the powershell_transcript_ger.txt, the last entry PS C:\Users\User> rm .\filename is not added as an event, since the finalizing separator line is not there. This is also the case in the most recent version.

As I wrote in my original comment, I think we needed to know that the currently parsed line is the last line in the file, so that we could wrap up the last event. Do you see any way to do this or do you maybe have another idea how to fix this? Or do we have to just accept this limitation?

Besides that, now everything seems to be working as expected to me.

Jan 09 '23 08:01 FabFaeb

I found that you removed my TODO comment regarding the last entry in a transcript that was not exited cleanly

I replaced it with https://github.com/log2timeline/plaso/pull/4168/files#diff-d263ca460ae4064ef96c2981245da460bd3a4cf0438265403bb7715a14a305e3R128 but needed to know some more context first. I might have an idea to add support for it be as processing warnings to begin with.

Jan 09 '23 16:01 joachimmetz

Note to self add footer support / incomplete file warning

Jan 09 '23 16:01 joachimmetz

startbootstrap startbootstrap copied to clipboard

Add parser for PowerShell transcript files

One line description of pull request

Description:

Notes:

Checklist:

Codecov Report

startbootstrap
startbootstrap copied to clipboard