Metadata-Anonymisation-Toolkit icon indicating copy to clipboard operation
Metadata-Anonymisation-Toolkit copied to clipboard

Fork of: https://gitweb.torproject.org/user/jvoisin/mat.git

METADATA: Metadata consist of information that characterizes data. Metadata are used to provide documentation for data products. In essence, metadata answer who, what, when, where, why, and how about every facet of the data that are being documented.

METADATA AND PRIVACY: Metadata within a file can tell a lot about you. Cameras record data about when a picture was taken and what camera was used. Office documents like pdf or Office automatically adds author and company information to documents and spreadsheets. Maybe you don't want to disclose those information on the web.

WARNING : Mat only removes metadata from your files, it does not anonymise their content, nor can it handle watermarking, steganography, or any too custom metadata field/system.

If you really want to be anonym, use format that does not contain any
metadata, or better : use plain-text.

DEPENDENCIES: python2.6 (at least) python-hachoir-core and python-hachoir-parser shred (should be already installed)

OPTIONALS DEPENDENCIES: python-poppler and python-cairo : for pdf support python-mutagen : for massive audio format support exiftool : for massive image format support

USAGE: mat-cli --help or mat-gui

SUPPORTED FORMAT: Portable Network Graphics (.png) support : full metadata : textual metadata + date method : removal of harmful fields is done with hachoir

Jpeg (.jpeg, .jpg)
    support : full
    metadata : comment + exif/photoshop/adobe
    method : removal of harmful fields is done with hachoir


Open Document (.odt, .odx, .ods, ...)
    support : full
    metadata : a meta.xml file
    method : removal of the meta.xml file


Office Openxml (.docx, .pptx, .xlsx, ...)
    support : full
    metadata : a docProps folder containings xml metadata files
    method : removal of the docProps folder


Portable Document Fileformat (.pdf)
    support : full
    metadata : a lot
    method : rendering of the pdf file on a cairo surface with the help of
            poppler in order to remove all the internal metadata,
            then removal of the remaining metadata fields of the pdf itself with
            pdfrw (the next version of python-cairo will support metadata,
            so we should get rid of pdfrw)


Tape ARchive (.tar, .tar.bz2, .tar.gz)
    support : full
    metadata : metadata from the file itself, metadata from the file contained
            into the archive, and metadata added by tar to the file at then
            creation of the archive
    method : extraction of each file, treatement of the file, add treated file
        to a new archive, right before the add, remove the metadata added by tar
        itself. When the new archive is complete, remove all his metadata.


Zip (.zip)
    support : .partial
    metadata : metadata from the file itself, metadata from the file contained
            into the archive, and metadata added by zip to the file when added to
            the archive.

    method : extraction of each file, treatement of the file, add treated file
        to a new archive. When the new archive is complete, remove all his metadata


MPEG Audio (.mp3, .mp2, .mp1)
    support : full
    metadata : id3
    method : removal of harmful fields is done with hachoir


Ogg Vorbis (.ogg)
    support : full
    metadata : Vorbis
    method : removal of harmful fields is done with mutagen


Free Lossless Audio Codec (.flac)
    support : full
    metadata : Flac, Vorbis
    method : removal of harmful fields is done with mutagen

HOW TO IMPLEMENT NEW FORMATS: 1. add the format's mimetype to the STRIPPER list in mat.py 2. inherit the GenericParser class (parser.py) 3. read the parser.py module 4. implement at least these three methods: - is_clean(self) - remove_all(self) - get_meta(self) 5. don't forget to call the do_backup() method when necessary

ALTERNATIVES AND COMPLEMENTS: for images: exiftool (perl) : metadata manipulation exiv2 (C++) : metadata manipulation graphicsmagick (a fork from imagemagick) : cli image manipulation

for pdf: pdfminer (python) : pdf manipulation

other tools: an hexadecimal editor

NOTES: Formats that are not in the test suite are not well-tested, please don't trust the MAT about them !

LICENSE: This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
MA 02110-1301, USA.

THANKS: Mat would not exist without : - the Google Summer of Code, - the Python language - the amazing (and messy) hachoir library, - poppler and cairo's python bindings, - and the mutagen library many thanks to them !

KNOWN BUGS: Zipfiles are not totally cleaned, I know. I am working on a patch for zipfile.py