Add parser for unsaved Windows Notepad tabs
Just like Notepad++, the Windows Notepad application for Windows 11 is now able to restore unsaved tabs when you re-open the application. This blog explains where it is stored and that you should be able to somehow view the contents.
It turned out that there was a bit more to it than running strings or grep on it. The application stores the tabs in different formats, depending on the size and some other unknown factors. I've even encountered a file where 26 characters of text were encoded as 34(?!) separate blocks; each block containing a length field, a single character and a CRC32 checksum of that very small block. Using grep or strings on that file would not have yielded any results. Information stored in these Notepad tabs may be helpful during forensic investigations and/or incident response cases.
The file format uses LEB128 variable-integer encoding for the block sizes, which is not yet present in the dissect framework. Therefore, this PR depends on https://github.com/fox-it/dissect.cstruct/pull/69, so in the end it depends on a new major release of dissect.cstruct.
The dissect/target/plugins/aps/texteditor folder was created, with corresponding new record types, so unsaved tabs from other text editors can also be added in the future.
Refactored the code to work with cstruct v3 after https://github.com/fox-it/dissect.cstruct/pull/73 has been merged.
Heyo! You guys still working on this? I have gotten quite a bit labeled in my tabstate util. Found you guys mentioned this on John Hammonds tweet about the video going over this format.
I know there is a crc32 after the weird timestamp, but I have not implemented that.
I was wondering, because looking at the code you guys have, so far, there might be a few things you guys are missing. I don't really understand how this package interprets the C code, but if it's just like normal C structures, I think the magic and header start might be off, here.
This format seems to change a lot, depending on the "state" it is in (new tab, unsaved tab and saved tab). I was wondering if maybe @joost-j might have some more work done on this, and maybe we can collaborate to fill out the rest? Here is my tabstate util for Rust.
Hi @Nordgaren, I think it's a great idea to collaborate on this! I have observed different storage formats as well, but could not yet link them to any action with regards to the "state" yet, so I'm really interested in that as well. Let's try to fill out the rest indeed!
As mentioned in your repo, I'll contact you on Discord.
Hi,
I recently caught on to the Notepad tab hype and this plugin seems premature, IMO. It appears that Notepad tab files are a single format with three states:
- Unsaved file
- Saved file
- Saved file with unsaved data
The file extension can also slightly change depending on if the file is open somewhere else or not (.tmp files vs just bin).
As it's written, this plugin will attempt to parse unsaved and saved files and will miss a lot of things, even in the unsaved state. For example, these structs can't be right at all:
struct multi_block_entry {
uint16 offset;
uleb128 len;
wchar data[len];
char crc32[4];
};
struct single_block_entry {
uint16 offset;
uleb128 len;
wchar data[len];
char unk1;
char crc32[4];
};
In a large file, the offset cannot be a uint16 (65535 max size) because files can have more than 65535 characters. It's more than likely an uleb128. Additionally, these structs don't address what happens when a character is deleted, because the structure is different. If I add the letter r at offset 2 in the unsaved state, the file will append: 02 00 01 72 00 A0 29 D5 3E
Where 02 is the offset, 00 is "unknown" (not really unknown, read below), 01 is the uleb128 length, 72 00 is the wchar, then the crc32.
But if I remove that r, the file appends: 02 01 00 E5 DE 3C 3D. Same offset "02" but the next value is a "01" which is the length of the item I removed (if you highlight and delete multiple characters, it'll be whatever the length of the deleted characters was).
What this all equates to is that if the second uleb128 value is 00, then a character has been added. If it's anything other than zero, characters have been removed by that length.
With all that said, I've actually managed to figure out the notepad tab format (with the exception of a single u8 field) in all three states. I'll be posting about it soon and I'd recommend waiting before adding this plugin.
@daddycocoaman will you be contributing your findings?
@daddycocoaman will you be contributing your findings?
I have given all of this data to Joost, myself, a while ago. I will be making a full write-up shortly. I didn't have time when I told him I would, but, everything listen here is in my util, except the crc. I just haven't taken the time to put it in yet. Unexpected life events. haha.
So you guys should be able to implement what I have so far, and he markdown file should follow this weekend. Maybe even a .bt file if I can manage!
Hi,
I recently caught on to the Notepad tab hype and this plugin seems premature, IMO. It appears that Notepad tab files are a single format with three states:
* Unsaved file * Saved file * Saved file with unsaved dataThe file extension can also slightly change depending on if the file is open somewhere else or not (
.tmpfiles vs justbin).As it's written, this plugin will attempt to parse unsaved and saved files and will miss a lot of things, even in the unsaved state. For example, these structs can't be right at all:
struct multi_block_entry { uint16 offset; uleb128 len; wchar data[len]; char crc32[4]; }; struct single_block_entry { uint16 offset; uleb128 len; wchar data[len]; char unk1; char crc32[4]; };In a large file, the offset cannot be a uint16 (65535 max size) because files can have more than 65535 characters. It's more than likely an
uleb128. Additionally, these structs don't address what happens when a character is deleted, because the structure is different. If I add the letterrat offset 2 in the unsaved state, the file will append:02 00 01 72 00 A0 29 D5 3EWhere 02 is the offset, 00 is "unknown" (not really unknown, read below), 01 is the uleb128 length,
72 00is the wchar, then the crc32.But if I remove that
r, the file appends:02 01 00 E5 DE 3C 3D. Same offset "02" but the next value is a "01" which is the length of the item I removed (if you highlight and delete multiple characters, it'll be whatever the length of the deleted characters was).What this all equates to is that if the second
uleb128value is 00, then a character has been added. If it's anything other than zero, characters have been removed by that length.With all that said, I've actually managed to figure out the notepad tab format (with the exception of a single u8 field) in all three states. I'll be posting about it soon and I'd recommend waiting before adding this plugin.
I can confirm most of what you said here. It is uleb128, as Joost identified. You got the saved states, pretty much. There is the new tab state (which I believe you are calling unsaved) and additionally a like a "soft save", which happens when notepad closes, but the tab stays open without a filepath. This will write all of the buffer contents to the file, instead of the weird keystroke meme that it seems to be in the new tab state.
Did you get the entire metadata structure in the saved file state? That is pretty much the only data I haven't figured out all the way, but I know it's size.
you can checkout tabstate-util crate on my GitHub, if you want to cross reference your findings. Might be good. There's some weird curveballs in this format.
I believe there still might be some structures that only appear in special conditions. Those are kinda the hardest to work out.
I think a good idea now would be to consolidate test files so we can get all of the possible structures available for testing/parsing.
There is also this issue on my repo which has a lot of good information
There is also this issue on my repo which has a lot of good information
That is a great thread. To answer your question, yes I have all the fields of the saved state, including the cursor locations, timestamp, and a few other things in the format. There's no special delimiters in the format. Everything has a specific value.
To answer @Schamper, I hadn't actually heard of dissect before the comments on the John Hammond video. I think I might just put out the format and let people adopt it however they liked. Personally, as a red teamer, I have my own reasons. 😂
There is also this issue on my repo which has a lot of good information
That is a great thread. To answer your question, yes I have all the fields of the saved state, including the cursor locations, timestamp, and a few other things in the format. There's no special delimiters in the format. Everything has a specific value.
To answer @Schamper, I hadn't actually heard of dissect before the comments on the John Hammond video. I think I might just put out the format and let people adopt it however they liked. Personally, as a red teamer, I have my own reasons. 😂
What did you find the single byte and then 4 byte int that mirrors it around the cursor start and end, to be?
@daddycocoaman
Personally, as a red teamer, I have my own reasons. 😂
Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.
Dissect isn't just a blue-team tool, I use it as a red teamer myself all the time as it ships with some pretty sick file parsing capabilities (especially nice if you're don't have 4TB of bandwidth to spare when all you need is like 200 bytes or something). Hell, I'd like to see this plugin get implemented as well, can think of some pretty cool stuff you could do with it
If it's losing out on credit you're worried about, external contributors are always credited (as I was here and here) :)
Hi there, @ogmini and I both have some implementations of the tabs to near completion. ogimini has the C# implementation here and I have the ImHex pattern implementation here. Both aren't 100% complete yet but most of it is.
My .bt is uploaded, now.
It's missing new tab state files (the files with keystrokes instead of just characters) and tabstate files with extra buffers after the main one, for now, but it should be otherwise accurate.
Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.
I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.
What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?
Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as wordWrapEnabled, rightToLeftEnabled, showUnicodeControlChars, and I'm missing the last one at the moment.
The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲
Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.
I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.
What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?
Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as
wordWrapEnabled,rightToLeftEnabled,showUnicodeControlChars, and I'm missing the last one at the moment.The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲
Ah! More Notepad options... Hadn't thought to even test those yet. I hope you'll link the blog post when its published. This has been an interesting exercise and learning experience for me.
Not to change topics, have you looked at the Windowstate files? I started to take a stab at those to stop from fixating on the Tabstate files too much. It stores window size and position as one would guess. I've only started trying to figure out the rest of the file. https://github.com/ogmini/Notepad-Windowstate-Buffer
Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.
I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.
What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?
Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as
wordWrapEnabled,rightToLeftEnabled,showUnicodeControlChars, and I'm missing the last one at the moment.The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲
Amazing. Thank you! So now we just need the byte before the cursors and the byte at the end of those bools, then! Thank you for sharing! I will go label the 3 bools. The last byte could also be padding, maybe? Also, did you find anything for the 0x00 that comes after the sha256 hash and before the unknown 0x01 before the cursors?
I need to reconfirm but I think that first one is a null terminator for the SHA256. I remember seeing a comparison for 0x00 when it was comparing hashes but that could have been the data.
Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.
Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.
I've tagged that as the sequence number for the .0 and .1 files. It goes up incrementally. It is also a uLEB128.
https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin
*Edit
Oh, I think I misread your post. You already know about the .0 and .1 file. I've still assumed it to be a sequence number for the bin file. Just that it always appears to be 0x00.
Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.
I've tagged that as the sequence number for the .0 and .1 files. It goes up incrementally. It is also a uLEB128.
https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin
*Edit
Oh, I think I misread your post. You already know about the .0 and .1 file. I've still assumed it to be a sequence number for the bin file. Just that it always appears to be 0x00.
I have actually noticed that the the 4th byte in the file, which is supposed to be the saved state, is also the count of remaining characters in a file, I think? Could be a uleb128, too, as you mentioned. the 3rd byte in the file does seem to change, and I think I have seen it change with the correct magic, as well. I can't remember for certain. Will try to dig a bit, shortly!
I need to reconfirm but I think that first one is a null terminator for the SHA256. I remember seeing a comparison for 0x00 when it was comparing hashes but that could have been the data.
Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.
It would be weird for it to be a null terminator for the sha 256 hash, though, as it is a fixed sized hash in the file. 32 bytes, or 256 bits. But I also wouldn't put it past microsoft. It could also be padding, as well? One thing I find weird is that they are using a varint for a fixed sized int. I assume that is for future proofing, though. So maybe they just have a fixed sized varint (lol) and some extra padding to compensate? That also seems like an out there idea, but just throwing it out there in case it helps anyone else see something different.
So far it doesn't look like they care about padding or even alignment, tbh, although I haven't actually sat down to check. Plus, they could be reading the individual bytes for the sha 256 hash, anyhow, so thus no real alignment issues. Should only be an issue if they are reading the type as multiple u32/u64/u128s or a single u256
Hi @ogmini, @daddycocoaman, @JustArion , thanks for all the suggestions and tips! The PR was aimed at getting initial support for unsaved tabs into dissect. Then the John Hammond video was published, after which it seemed that more and more about this file format was being researched by multiple people. I had indeed not yet uncovered/reversed all of the fields and structures that were suggested above, which e.g. also takes into account the state of the application (closed/opened). Looks like I can include some more test files covering all different states, to make this plugin more complete. Obviously, I will also incorporate the suggestions that were made in this thread into my code.
Unfortunately I have not been able to work on this project for the past few weeks, but I'm planning to pick up the research again in the near future.
Fixed the remaining code comments as of now @Schamper @Horofic. However, will include more functionality later this week, as a lot more about this file format is now known.
Added some more test cases, including some with the application closed/opened and including some with character deletions. Also simplified the code a bit; refactored some unnecessary complex operations and reduced the amount of structs.
The main goal for this plugin is to recover the text contents of the tab files and be able to parse and collect those during IR / Forensic investigations. So there are still some unk variables in the struct, of which I don't know all the details, but for this use-case that seems to be acceptable. Again, thanks all for the suggestions!
@daddycocoaman do you have a possible ETA when your research will be published?
@daddycocoaman do you have a possible ETA when your research will be published?
Sorry for the delay. At some point during this, I think the tab state format added a few new fields and it broke my ImHex parser. For example, there is now one for spell check being enabled that comes right before the length of the content. I'm looking into the new fields in my spare time.
@daddycocoaman do you have a possible ETA when your research will be published?
Sorry for the delay. At some point during this, I think the tab state format added a few new fields and it broke my ImHex parser. For example, there is now one for spell check being enabled that comes right before the length of the content. I'm looking into the new fields in my spare time.
What version of Notepad is it reporting?
@daddycocoaman do you have a possible ETA when your research will be published?
Sorry for the delay. At some point during this, I think the tab state format added a few new fields and it broke my ImHex parser. For example, there is now one for spell check being enabled that comes right before the length of the content. I'm looking into the new fields in my spare time.
What version of Notepad is it reporting?
Had a few moments to quickly poke around. There definitely are some changes to the format that appears to lineup with new options in what I call the options block. It appears that 1 byte was added which probably lines up with the spellcheck/autocorrect option.
Top is the new state format with the extra byte. Bottom is the previous state format. Both are just an empty unsaved tab.
My test machine is running the stable build of Windows 11 23H2 OS Build - 22631.3527 and Windows Notepad version 11.2402.22.0. The format has changed but I don't have the options to enable/disable spellcheck or autocorrect within Windows Notepad. I'll spin up another test system later on a Release Preview build to see if I get the options.
https://blogs.windows.com/windows-insider/2024/03/21/spellcheck-in-notepad-begins-rolling-out-to-windows-insiders/
Edit:
Spun up a quick VM on the Beta Release channel. Windows 11 23H2 OS Build - 22635.3566 and Windows Notepad version 11.2402.22.0. Again, the format is different from the above. Looks like another byte was added. Screenshot is of an empty unsaved tab.
Looking at your screenshots @ogmini, the byte at offset 0A is different between the versions and may indicate some kind of optionsVersion number. Thank you both again for the updates, I'll edit my dissect parser accordingly as well!
Looking at your screenshots @ogmini, the byte at offset
0Ais different between the versions and may indicate some kind ofoptionsVersionnumber. Thank you both again for the updates, I'll edit my dissect parser accordingly as well!
That was one of my theories. But it doesn't make total sense. Why would you have the version so late in the stream? The files do store content length, so it knows how far to read the stream. Is it possible that the byte at offset 0A is letting us know how many more bytes to read? We'll probably need to wait for MS to make more changes in the future to really verify.
Consolidated screenshot with some coloring for easier reading.
I added the optionsVersion field in the latest commit. Although we might not be 100% sure if this is correct, we can always fill in these gaps later on. At least, the dissect parser now seems to be able to parse the newer format as well. The parser is now also able to handle variable-length data buffers that is passed after a fixed-length data buffer.