jabref icon indicating copy to clipboard operation
jabref copied to clipboard

Add feature that removes XMP metadata from pdf(s)

Open ThiloteE opened this issue 4 years ago • 9 comments

Problem: My goal was to write XMP metadata to a pdf, but there was metadata attached to the pdf already (because i already attached another entry to it once, but that entry had wrong data), so now there is data that i do NOT WANT to have attached + the data that i WANT to have attached.

Describe the solution you'd like Add a feature that allows to remove all XMP metadata from one or multiple pdfs.

ThiloteE avatar Nov 26 '21 15:11 ThiloteE

When i push

@Test{test,
  author = {test},
  date   = {2021},
  file   = {:Thompson (2020-03-30) middle-class-remorse-re-embracing-liberal-democracy-in-the-philippines-and-thailand.pdf:PDF},
  title  = {test},
}

to the pdf file, there is still some leftover XMP metadata, as can be seen here:

image

The current Jabref implementation is good, if one wants to keep this old metadata, but if one wants to remove it, then this is currently not possible via Jabref.

ThiloteE avatar Nov 27 '21 10:11 ThiloteE

Related fruitful discussion on https://gist.github.com/hubgit/6078384, but i am not sure in how far tools like these can be implemented with Jabref.

ThiloteE avatar Nov 27 '21 11:11 ThiloteE

Current "feature" of JabRef: Remove specified fields. Maybe this helps for your case somehow? When listing "language", "number", ... in the fields to clear? (Feature highly requested by @adaerr few years ago)

grafik

koppor avatar Dec 06 '21 20:12 koppor

Yes! Indeed, Koppor, this is a step in the right direction. Thanks! I did some tests:

These conditions need to be fulfilled for Metadata to be deleted:

  1. Do not write the following fields to XMP Metadata needs to be ticked
  2. The field that is supposed to be deleted needs to be in the list Koppor posted in https://github.com/JabRef/jabref/issues/8277#issuecomment-987172127 .
  3. The field that is supposed to be deleted needs to be in the library data of the entry that is supposed to be written to the linked file.

Example:

@Test{test,
  author   = {test2},
  date     = {2021},
  file     = {:test/test2 (2021).pdf:PDF},
  language = {korean},
  number   = {2},
  title    = {test},
}

If i put Number into the list while Do not write the following fields to XMP Metadata is ticked, it will delete the metadata instead of writing the number 2.

If i do the same, except removing the number field from the entry (number = {2},) it will NOT delete the metadata.

So we have found out how to delete metadata with Jabref and the way it currently works, it allows very fine grained usage. This is good. It is not perfect, but it is good. A possible and low hanging fruit for improvement would be to ease the workflow by cutting down on the conditions that need to be fulfilled to delete something, especially condition 3 seems tedious.

ThiloteE avatar Dec 07 '21 03:12 ThiloteE

The next question i asked myself: How then would it be possible to delete the metadata not only for a single entry but for ALL pdf files i have linked to the entries within my library?

My aim is to substitute all the 'bad' and 'false' metadata that is currently attached to my pdfs with the (maybe not perfect, but at least ... ) more correct metadata i got from importing via DOI and manual corrections.

Prototype (untested) workaround

  1. Create an entry with ALL fields available.
  2. Link ALL pdfs that i want to remove metadata from to this one entry. Probably via main file directory.

Problems with workaround:

  1. Tedious (File would need to be moved, Fields need to be entered in the list, fields need to be entered in the entry)
  2. I am not yet sure what kind of search expression would work that will find ALL files within that directory.
  3. Knowledge about this method is needed. New users may not be aware.

Edit:

Proper solutions would be:

Specify in the preferences which metadata should be deleted. Give option to delete all the metadata JabRef is able to write. (e.g. at least all bibtex fields). Then:

  • Entry based solution(s): A)

    • user has PDFs attached/linked to entries. Linkage DOES matter.
    • user selects entries.
    • user presses button SOMEWHERE to delete specified metadata.
      • the button could be in the right click menu OR
      • here beneath F6 option: image Add button: "Delete Metadata from PDFs"

    Strength of this solutions: - Workflow is very fast. - Single entries/PDFs are easily changed. - PDFs do not have to be in the same folder

and / or

  • Folder based solution(s): B)

    • Have PDFs in a folder. They MAY or MAY NOT be linked to an entry. Linkage does NOT matter.
    • Have user point to this folder within JabRef (e.g. via "main file directory")
      • the main file directory can be changed in the preferences at options > preferences > linked files
    • Add button somewhere "Delete Metadata from PDFs within the "Main File Directory"
    • Add button somewhere "Delete Metadata from PDFs within folders ..."
      • opens a dialogue that allows to specify the directory

    Strength of this solution: - PDFs that are not linked to entries yet are included with this approach.

How to do this, I don't know.

I favour an entry based solution. The current tedious method as explained in https://github.com/JabRef/jabref/issues/8277#issuecomment-987540344 is also an entry based solution. If push comes to shove, Exiftool and other tools exist that are folder based, so I think it is alright if JabRef goes the entry based direction.

ThiloteE avatar Dec 07 '21 03:12 ThiloteE

Hi, I am new to open source and would like to contribute to the project. Is it okay for me to try working on this issue?

Hey, I edited my last comment!

Yes you may :) Thanks for your interest!

Check out https://github.com/JabRef/jabref/blob/main/CONTRIBUTING.md for a start. Also, https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace is a good start. Feel free to ask if you have any questions here on GitHub or also at gitter.

Try to open a (draft) pull request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

ThiloteE avatar Apr 06 '22 10:04 ThiloteE

Noted. I will open a draft PR once I make some progress.

For testing your changes, I can recommend ExifTool. https://www.exiftool.org/

ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files.

In the FAQ it is explained how to extract (read) really all available metadata that is attached to an PDF

  1. "How do I extract absolutely all metadata from a file?"

    By default, duplicate tags, unknown tags, embedded tags, and System tags that require external utilities are not extracted. The main reason for this is performance; extracting these tags will significantly increase processing time for some files. The following command extracts everything possible with ExifTool:

    exiftool -ee3 -U -G3:1 -api requestall=3 -api largefilesupport FILE

    (The -G3:1 option is included in the above command only to give an indication of where the metadata was stored.)

ThiloteE avatar Apr 06 '22 11:04 ThiloteE

Some code was done at https://github.com/JabRef/jabref/pull/8681, however, the contributors did not continue working on it. Potential contributors can use the code and the discussions as basis.

koppor avatar Apr 03 '23 14:04 koppor

This is a CleanUpJob similar to org.jabref.logic.cleanup.MoveFilesCleanup.

koppor avatar Mar 10 '24 18:03 koppor