puremagic
puremagic copied to clipboard
2024-08-25: Lots of new formats
Updates
This update is primarily for new formats, it's bit of a mix this time around. Moving forward my update texts are changing as well, I'm using VSCode to create the blurb so aiming to try for a clearer layout 😎
Lots of new matches and some fixes for older entries. Some of the ASCII translations of the hex have been left out as it broke GitHub when I was pasting them in.
On the subject of matching, it's now becoming clearer that as part of v2 rebuild plan, .zip
, .xml
and the Microsoft Compound File
formats all need to have some form of unpacking/decoding to allow for better matching and less alternative confidences. Some of these are now giving 20-30 matches.
Formats
Canon Camera RAW 2
Extensions: .cr2
Magic: Intel TIFF* then a second marker of 0x435202
/ CR
at byte 8
An update to Canon's original RAW format, this uses a beefed-up Intel* TIFF file. Like TIFF there is a lot of info we can extract if we wanted to in later v2.0 expansion ideas.
*There may possibly be Motorola encoded .cr2
files out there as well going by one source, but my 350D files are Intel flavoured so I've only added that for now.
Panasonic RAW and RAW2 and LEICA RAW
Extensions: .raw
.rw2
.rwl
Magic: 0x49495500
There are entries for these file extensions in the .json, however, I suspect they are either duff entries or they will only match the file from which it was sourced. From my own Panasonic FZ1000 and various test files on the links below they all start with the magic above. I have not removed the existing entries as I may be wrong about them not being valid. The LEICA cameras are basically posh Panasonic's and use the same file format with a different extension, all other details are the same.
If anyone comes across Panasonic RAW's that don't match please leave a comment so we can take a look.
Comic Book Archives
Extensions: .cb7
.cba
.cbr
.cbt
.cbz
These are simply archives containing image files in numerical order, the extension gives away the parent formats of 7-Zip, Ace, RAR, TAR and Zip. Headers are identical to the parent formats they use.
PRC, Mobipocket and Amazon Kindle eBooks
Extensions: .prc
.mobi
.azw
.azw1
.azw3
.azw4
.tpz
.kfx
.kcr
This is a weird hodgepodge of formats, Starting with the original AportisDoc document .pdc
, U.S. Robotics .prc
and Mobipocket SA .mobi
formats, which hail from the PalmPilot era, they eventually morphed into Amazon Kindle files with only the extensions to tell them apart (there are deeper changes but on the surface, they are essentially the same). To be annoying a lot of eBooks have the .mobi
or .azw
extension when they should really have something else, this will affect FILE based scores as that uses the extension as part of the scoring.
Starting with the KF8 .azw3
format files could be MOBI or dual format MOBI/EPUB but still have the same extension, KF10/KFX .azw4
/ .kfx
files are a completely new format. There are even more subformats/subversions than I have added but I need to learn more about them or get samples, or PureMagic needs new features to dig deeper into the files.
PalmDoc
Extensions: .prc
Magic: 0x5445587452454164
/ TEXtREAd
at byte 60
Pretty much the grandaddy of them all, The PalmDOC eBook format from the PalmPilot series of handhelds. Technically they are a subformat of Palm DOC .pdb
(see below) but the header is what classes them as a PRC eBook. The will conflict match with AportisDoc document .pdc
as they are one and the same filetype, U.S. Robotics used the format as the basis for the Palm operating system. Just to be awkward, a .prc
may be a .pdb
and vice-versa.
I've also added .prc
as an extension only due to it being used to store all manner of data on Palm Pilots.
MOBI and early Kindle eBooks
Extensions: .mobi
,.azw
,.azw3 (MOBI)
Magic: 0x424f4f4b4d4f4249
/ BOOKMOBI
at byte 60 and a footer of 0xe98e0d0a
at -4
The most common of the formats in this batch, most commonly found eBooks are a MOBI (aka Mobi6 format) regardless of its extension. Some old MOBI may have the extension .prc
or .pdb
from their PalmPilot roots.
Topaz DRM eBooks
Extensions: .awz1
and .tpz
Magic: 0x54505a
/ TPZ
at different offsets per file
These are DRM encrypted files delivered via Whispernet or downloaded to your PC, I have a single .azw
file in my Kindle library which is DRM'd (others are newer .azw4
), I need more samples but based on DeDRM we should be looking for the TPZ
magic, it's not at a fixed position so adding as extension only for now, v2 upgrades should let us test for this.
Kindle KF8 eBooks
Extensions: .azw3
Magic: 0x424f4f4b4d4f4249
/ BOOKMOBI
at byte 60 and a footer of 0x434f4e54424f554e44415259e98e0d0a
/ CONTBOUNDARYé
at -16
These are dual format MOBI/ePub eBooks that have the tag BOUNDARY
at the end of the MOBI data, however this is not a fixed position so would require a v2 upgrade to search for this, handily they also have an longer footer than regular MOBI files, we'll use that instead. 😊
Amazon Print Replica eBook (aka Kindle Format 10/KF10/KFX)
Extensions: .azw4
, .kfx
Magic: 0xea44524d494f4eeee00100eaee9e8183de9a86be97de95848d50726f74656374656444617461
This is the current Kindle format, all my files downloaded through Kindle for Windows still use .azw
for the extension, so again FILE based scores will be affected. However, with a ridiculously long match you'll be more than certain it's this format. There is a version number but we'd need a regex to ensure correct reporting of just the digits as they seem to follow the pattern v1.1blahblahblah
or v1.85blahblahbah
given how many versions there could be that would mean a lot of extra data in the database if we went with fixed strings.
Kindle Cloud Reader and Kindle for Mac
Extensions: .kcr
As the label suggests these are another wrapper for Kindle files. From limited info they are an .azk
wrapped in DRM. I have no samples for these, so adding the extension only for now.
Kindle Preview file
Extensions: .azk
A PK zip based file format used by Kindle Previewer and older iOS Kindle apps. Again, no samples available so extension only for now.
Sundry files
These are files that you'll find with some eBooks, none are eBooks themselves but provide functionality to them.
-
.voucher
appears to be the DRM key for KFX eBooks, all start0xe00100eaee9e8183de9a86be97de95848d50726f74656374656444617461
-
.mbpV2
is a metadata file, it stores the last position and annotations. It's a basic JSON data file starting0x7b226d6435223a22
/{"md5":"
-
.mbp
is the original MOBI metadata file, like it's newer brother above it does the same job. I have no sample files so adding as as extension only for now. -
.azw.res
these are Resource Containers that hold external data files such as high-res images, part of the AZW6 specification originally aimed at Japanese Manga and Graphics novels, western comics adopted the same format to offer higher quality images. Header of0x434f4e540200
-
.azw.md
these are Metadata Containers, they use the same header as.azw.res
. -
.phl
are Amazon Kindle Popular Highlights Files, these are an XML file that show how many people highlight certain passages etc... All start with0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d3822207374616e64616c6f6e653d22796573223f3e
/<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
-
.azw9.res
Resource containers for Kindle on the MAC, no samples so adding as an extension for now. -
.azw9.md
Metadata containers for Kindle on the MAC, no samples so adding as an extension for now.
This is a proper Rabbit Hole job, it took way longer than I thought it would and there is still more to uncover. This pile of links covers most of what I dug out. As samples become available and new features are added to PureMagic we can do more with this bunch of formats. Some information is contradictory so I expect there will be tweaks to this lot over time.
- Wikipedia: Mobipocket
- FileFormat: AZW
- FileFormat: AZW1
- FileFormat: AZW3
- FileFormat: AZW4
- FileFormat: PRC
- FileFormat: PHL
- Kindle Unpack
- DeDRM
- File.org: MBP File
- MBP Reader* *Make sure you have an ad-blocker running as it's an old site with many popups
- Justsolve: MOBI
- Justsolve: PalmDOC
- kiehinen
- Rough Script to dump High Res images from AZW6
- Difference Between Kindle KFX, KCR, AZW, AZW3, PRC, Mobi, Topaz, AZW6
- Forum post regarding KCR files
- MobileRead Wiki: AZK
- MobileRead Wiki: AZK
- Forum post about differences in MOBI formats
Palm OS Database
Extensions: .pdb
The primary data format for the PalmPilot (also Visor handspring and Sony CLIÉ) series of handheld devices. A bit like RIFF and IFF, it's a container format that wraps around many types of data. .prc
, .mobi
and AportisDoc document .pdc
files are a form of PDB but as they get lumped in with all the other eBook formats, I left them above. All files share the same extension with just the byte 60 header changing. There are later PDB files that use zTXT (such as Weasel), but that is another kettle of fish entirely. All PDB use the same mimetype with PalmOS deciding what to do once it looks at the subformat tag.
Much like the eBooks above, the extension does not mean a lot, a Palm File could easily be an application and still have a .prc
extension for example.
Subformats
All these start at byte 60
-
Palm Pilot Applications:
0x6170706c
/appl
-
Palm Pilot zTXT Compressed file:
0x7a545854
/zTXT
-
GrayPaint
0x444154414772503f
/DATAGrP?
-
Adobe Reader
0x2e70646641444245
/.pdfADBE
-
BDicty (Dictionary Reader)
0x42566f6b42444943
/BVokBDIC
-
DB (Database program)
0x4442393944424f53
/DB99DBOS
-
eReader (aka Palm Reader)
0x504e526450507273
/PNRdPPrs
-
eReader
0x4461746150507273
/DataPPrs
-
FireViewer (ImageViewer)
0x76494d4756696577
/vIMGView
-
HanDBase
0x506d4442506d4442
/PmDBPmDB
-
InfoView
0x496e666f494e4442
/InfoINDB
-
iSilo
0x546f476f546f476f
/ToGoToGo
-
iSilo 3
0x53446f6353696c58
/SDocSilX
-
JFile
0x4a6244624a426173
/JbDbJBas
-
JFile Pro
0x4a6644624a46696c
/JfDbJFil
-
LIST
0x444154414c536462
/DATALSdb
-
MobileDB
0x4d6f62696c654442
/Mdb1Mdb1
-
Plucker
0x44617461506c6b72
/DataPlkr
-
PQA
0x70716120636c7072
/pqa clpr
-
QuickSheet
0x4461746153707264
/DataSprd
-
SuperMemo
0x534d3031534d656d
/SM01SMem
-
TealDoc
0x54455874546c4463
/TEXtTlDc
-
TealInfo
0x496e666f546c4966
/InfoTlIf
-
TealMeal
0x44617461546c4d6c
/DataTlMl
-
TealPaint
0x44617461546c5074
/DataTlPt
-
ThinkDB
0x6461746154444250
/dataTDBP
-
Tides
0x5464617454696465
/TdatTide
-
TomeRaider
0x546f526154525057
/ToRaTRPW
, these may also have a.tr
extension -
Weasel
0x7a54585447506c6d
/zTXTGPlm
-
WordSmith
0x42444f4357726453
/BDOCWrdS
Not an exhaustive list but like RIFF and IFF there are going to always be more.
TomeRaider eBooks
Extensions: .tr
.tr2
.tr3
Magic:
.pdb
have 0x546f526154525057
/ ToRaTRPW
at byte 60 (as above)
.tr
and .tr2
have 0x370000106d000010d2160010dcf4ddfcd1
at byte 0
.tr3
have 0x5452334454523343
/ TR3DTR3C
at byte 60
This came up while doing the Palm Doc entries. TomeRaider is another eBook format that started life on the PalmPilot series of devices. There are three formats, the .pdb
version, then later on TR2 and TR3. TR2's and the old PDB version may both use .tr
when not on a Palm device. Calibre cannot read any of these files (not that I can find a TR2 sample but I imagine it also does not work) which is a shame, maybe a new project for me to look into...
- TomeRaider via Webarchive Don't go beyond 2011, it becomes a blog, then a spammy link site.
- FileFormat: TR
- FileFormat: TR3
- Justsolve: TomeRaider
FictionBook 2 and FictionBook 3
Extensions: .fb2
.fb2.zip
.fbz
Magic:
.fb2
has 0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e0a3c46696374696f6e426f6f6b
/ <?xml version="1.0" encoding="UTF-8"?> <FictionBook
.fbz
and .fb2.zip
are just normal PK Zip files with an FB2 inside
.fb3
are also just normal PK Zip files with a similar structure to an ePub
Another eBook format that is popular in Russia but nearly unused anywhere else. FictionBook 2's are an XML file with everything stored within it as a monolithic block, the compressed variants are simply a zip file with a single FB2 inside. FictionBook 3's are a similar idea to ePub in that they are just a zip file with a structured layout. Yay! more PK Zip matches....
Windows Help Files
Extensions: .hlp
.gid
.cnt
Magic:
.hlp
and .gid
both have 0x3f5f0300
at byte 0, then 0x0000ffffffff
at byte 6
.cnt
has 0x3a42617365
/ :Base
at byte 0
This was already in the data base but .hlp
was split over two entries, I've condensed then into one superior match. .gid
are a metadata file that stores the last window position and size (but not read position), there is not much info but looking at the samples created when I open .hlp
they all have the same starting layout. There are some other tags we can look for later to enhance confidence between both files.
.cnt
files are a plain text file containing the chapters for a Help file, they add a graphical Table of Contents (TOC) tab to the Search/Find tabs under Win95. Assuming no blank or lines with no colon :
the first line should always match the magic.
- Justsolve: HLP
- Wikipedia: WinHelp
-
Solving the "Cannot display this help file" error
Handy if you get
Cannot display this help file. Try opening the help file again, and if you still get this message, copy the help file to a different drive, and try again
- FileEXT: GID
- Windows Help File Format
- Adding a CNT to a WinHelp3 file
- List of programs that can open .cnt files
MS Reader eBook
Extensions: .lit
Already in the .json, just added the mimetype application/x-ms-reader
Sony Broad Band eBook (aka BBeB)
Extensions: .lrf
.lrf
.lrx
Magic:
.lrf
has 0x4c00520046000000
at byte 0
.lrx
and .lrf
no samples, extension only for now
A proprietary eBook format from Sony and Canon mainly aimed at the Sony Librié. .lrs
are XML files that can be read as an eBook, but are aimed at being the source files for the other two extensions. .lrf
and .lrx
are compiled and compiled with DRM.
- Wikipedia: BBeB
- FileSamples: LRF
- LrfFormat via WayBack
- KeyView Viewing SDK Programming Guide: Supported Formats
Rocket eBook
Extensions: .rb
Magic:
eBooks have 0xb00cb00c
/ °°
(bookbook)
System files use 0xb00cc0de
/ °ÀÞ
(bookcode) or 0xb00cf00d
/ °ð
(bookfood)
Another proprietary format, this is for the NuvoMedia Rocket eBook reading device, reportedly the first dedicated eBook reader released in 1997. There are possibly DRM versions of the file that may differ from these entries.
- GITHUB: Calibre/rb.txt
- Justsolve: Rocket eBook
- MobileRead Wiki: RB
- KeyView Viewing SDK Programming Guide: Supported Formats
Text Compression for Reader eBook (aka Psion Series 3 eBook)
Extensions: .tcr
Magic: 0x2121382d4269742121
/ !!8-Bit!!
at byte 0
A text compression format I stumbled across while looking into Rocket eBooks. Quite possible the oldest format in this PR, it harks from the days of Psion Series 3 and 5's.
Shanda Bambook eBook (aka SuperNote Book)
Extensions: .snb
Magic: 0x534e425030303042
/ SNBP000B
at byte 0
This is an eBook format for the amazingly named Shanda Bambook, a Chinese eBook reader. All info and test files I've got fail to work in Calibre, would be nice to have a working sample file.
Cheat Engine Trainer Data
Extensions: .CETRAINER
Magic: 0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d227574662d38223f3e0d0a3c43686561745461626c65
/ <?xml version="1.0" encoding="utf-8"?> <CheatTable
at byte 0
These are another XML format document used by CheatEngine for storing a trainer before being compiled into an executable. Longer match to help prevent false positives against other XML based files.
Quake PAK files
Extensions: .pak
.bsp
.mdl
.lmp
.dem
.map
.rc
.spr
Magic:
.pak
has 0x5041434b
/ PACK
at byte 0, then may have 0x4944504f
/ IDPO
, 0x52494646
/ RIFF
or 0x49425350
/ IBSP
at byte 12
.bsp
has 0x1d000000
or 0x1c000000
typically at byte 0, other versions may exist
.mdl
has 0x4944504f
/ IDPO
at byte 0
.map
has 0x7b0a22
/ { "
at byte 0 assuming no comment lines before
.spr
has 0x49445350
/ IDSP
at byte 0
.lmp
, .dem
.rc
have no fixed headers, extensions only
The Quake PACK format shares a header with many other file types, we have entries in the JSON already but I've added these extra markers to help boost confidences. Not all files have them but it helps those that do.
Other Quake files
-
.bsp
are compiled level files -
.mdl
are 3D models used for characters, monsters, weapons etc... -
.lmp
are various image related files -
.dem
are recorded demos (or movies) of levels -
.map
are un-compiled map files that are used to make.bsp
, they look a little like JSON at a glance -
.rc
are Resource files, basically a scripting language -
.spr
are sprite files
Python Pickle
Extensions: .pickle
Magic:
Protocol 0 has 0x28
/ (
at byte 0
Protocol 1 has 0x7d71
/ }q
at byte 0
Protocol 2 has 0x8002
/
at byte 0
Protocol 3 has 0x8003
/
at byte 0
Protocol 4 has 0x8004
/
at byte 0
Protocol 5 has 0x8005
/
at byte 0
All end 0x2e
/ .
at -1
Pickle is a data dump format for Python, there is an existing extension only but we can remove that now. The headers are small but thanks to the footer always being a .
there should be no issues. Justsolve's protocol 1 file seems to not match my generated files when using protocol=1
(looks like it's a 0), going with my files for the magic. I've left the extension in for now to allow for fringe cases or later
Smacker video
Extensions: .smk
Magic: Either 0x534d4b32
/ SMK2
or 0x534d4b34
/ SMK4
at byte 0
A popular video file format from the mid 90's, loads of early CD games used it due to it's decent compression and for the time fairly decent quality. There are two versions, not sure what the later one added.
Bink video
Extensions: .bik
.bk2
.bik2
Magic: 0x42494b
/ BIK
at byte 0
Another popular video file format from early to mid CD era games, this replaced Smacker. There seems to be some confusion over the amount of FourCC's this format has: BINK
BIKb
BIKf
BIKg
BIKh
BIKi
BIKd
are all considered valid. The samples I found all used BIKi
but for now I have gone with just BIK
and the extension .bik
until more samples appear, this covers most potential files out there.
AmigaGuide
Extensions: .guide
Magic: 0x40646174616261736520616d69676167756964652e6775696465
/ @database amigaguide.guide
at byte 0
The AmigaGuide document was made for creating navigable help files, they work much like Windows Help files.
CRI Movie 2
Extensions: .usm
Magic: 0x43524944
/ CRID
at byte 0
Another proprietary video format used in various games, especially those coming from Japanese studios. It's an annoying format as on later Windows version the audio no longer plays back due to the weird semi off standard codecs they used.
Adobe flash video file
Extensions: .flv
Magic: 0x464c5601
/ FLV
at byte 0, then 04
, 01
or 05
for audio, video or both at byte 4
This is a tidy and improvement of existing entries, there were two .flv
but one lacked the last byte pair so I've removed that from the JSON. Added little secondary matches for extra confidence boosts.
Microsoft Works files
Extensions: .wdb
.wks
.xlr
.wps
Magic:
Early .wdb
versions have 0x20540200000005540200
at byte 0
Later .wdb
.wps
and .xlr
versions have 0xd0cf11e0a1b11ae1
/ ÐÏࡱá
at byte 0
Early .wps
have 0x01fe
/ þ
at byte 0
Microsoft Works was a cut down budget office that offered everything you needed in one package, it saved documents in a semi proprietary format that MS did support in Office but later dropped. .wdb
were the Works equivalent to Access, .wks
/.xlr
were spreadsheets, and .wps
was a text file.
There are some differing versions, early formats were just for Works, later ones were still Works specific but used the Microsoft Compound File format and identifying them may be trickier as we need to decode the CLSID identifier from the file. In fact the format is the basis for many many formats much like a RIFF or IFF, expect conflict clashes. Definitely a candidate for v2 identification upgrades.
- Justsolve: Microsoft Works Database
- Justsolve: Microsoft Compound File
- Justsolve: Microsoft Works Word Processor
- Justsolve: Microsoft Works Spreadsheet
JPEG XR, Windows Media Photo and Microsoft HD Photo File Format
Extensions: .jxr
.wdp
.hdp
Magic:
All files should have 0x4949bc01
/ II¼
at byte 0
.jxr
also has 0x574d50484f544f00
/ WMPHOTO
at byte 90
.hdp
I cannot find any samples, extension only for now.
Another member of the JPEG Family, derived from the Windows Media Photo and and Microsoft HD Photo formats, it's part MS, part JPEG, part butchered TIFF. The format is a mess.
- Justsolve: JPEG XR
- LOC: JPEG XR Image Encoding
- LOC: JPEG XR File Format (JXR)
- LOC: HD Photo, Version 1.0 (Windows Media Photo)
- Wikipedia: JPEG XR
- JPEG XR
JPEG-LS
Extensions: .jls
Magic: 0xffd8fff7
/ ÿØÿ÷
at byte 0
Another JPEG format that is also not quite a format, it's a subset of regular JPEG and also has roots in HP's own lossless codec (which apparently is in one of the old Mars rovers). JustSolve magic suggestions match the output from the HP Reference encoder linked there, and at the CharlLS WebAssembly demo linked below. XnView would not view them despite claiming support, the online demo could read successfully converted images from the HP encoder. I've gone with the longer magic based on the test files, this should allow it to win confidence over regular .jpg
Fixes
There are also some small changes to various entries, fixing spelling errors, unifying names or adding mimetypes