jabref icon indicating copy to clipboard operation
jabref copied to clipboard

Add option to delete XMP metadata

Open vithlaithla opened this issue 2 years ago • 26 comments

Fixes #8277

  • [ ] Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • [ ] Tests created for changes (if applicable)
  • [ ] Manually tested changed features in running JabRef (always required)
  • [x] Screenshots added in PR description (for UI changes)
  • [ ] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • [ ] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Hi, I am working with Jefferson on this issue and we made some changes to fix issue #8277. Our changes added a button to delete metadata that are selected under xmpPreferences in the screenshot here. Screen Shot 2022-04-17 at 5 36 21 PM We also added a checkbox to include all fields inside the xmpPreferences, which should select all fields of an entry.The UI works as follows:

  • To enable the "Select All Fields" checkbox, the "Do not write the following fields to XMP metadata" must be checked.
  • The checkbox, when enabled, disable the filter list. Screen Shot 2022-04-17 at 5 36 46 PM

There are some problems that we face:

  1. When we make changes to the metadata, sometimes it corrupts the file's metadata. For our XmpUtilRemover, we mainly reuse the XmpUtilWriter, so we might be misusing some functionalities.
  2. We are not sure on how to test the functions yet. We hope to get some guidance.
  3. We changed the xmpPreferences's class, it might be a disruptive change.

vithlaithla avatar Apr 17 '22 11:04 vithlaithla

Thanks for your contribution, please setup Jabrefs code style for intellij https://github.com/JabRef/jabref/tree/main/config

Siedlerchr avatar Apr 17 '22 12:04 Siedlerchr

  1. When we make changes to the metadata, sometimes it corrupts the file's metadata.

    What do you mean by that? Can you give an example?

    Up until 2022-04-11, the development versions of JabRef were reproducibly corrupting pdf files. There was an issue with pdfbox version 3, with was (hopefully!) fixed by https://github.com/JabRef/jabref/pull/8658. Corrupt pdf files BEFORE that date can be explained by 8658. Please make sure to retest with the latest commits from AFTER that date.

  2. We are not sure on how to test the functions yet.

    You could try to create an entry with all supported fields. Write XMP metadata to PDF. Check with exiftool if metadata was written. Then, try to remove the metadata. Check with exiftool, if it was removed. See https://github.com/JabRef/jabref/issues/8277#issuecomment-1090167690 for info about exiftool.

    This method might be inefficient and is prone to user error. Maybe there are other ways how this could be tested, but if push comes to shove...

  3. This is above my head.

ThiloteE avatar Apr 17 '22 13:04 ThiloteE

Regarding the code style warnings. You can go to the "Files" tab in this PR and see the warnings directly. Deep link: https://github.com/JabRef/jabref/pull/8681/files

Screenshot:

grafik

May we ask that you DO NOT reformat the JavaDoc comments when you do not touch them? Reformatting then to a single line makes them more hard to read.

koppor avatar Apr 18 '22 10:04 koppor

@ThiloteE

  1. Thanks for mentioning that, we look into the PR you mentioned and it had the changes to resolve said problem.
  2. We wrote some test cases for the function, so hopefully those should cover most use cases. Since there were in-built functions, we used them to craft our test cases.

vithlaithla avatar Apr 19 '22 16:04 vithlaithla

For the changes:

  • we refine the initially added functions by following @koppor's requests and some changes ourselves,
  • resolves the first problem we faced by following https://github.com/JabRef/jabref/pull/8658 that @ThiloteE mentioned,
  • added tests to make it work as intended

vithlaithla avatar Apr 19 '22 16:04 vithlaithla

You are welcome! I am impressed with the amount of changes that needed to be done for this feature. Thank you! Will try out this pull request within the next 3 days.

ThiloteE avatar Apr 19 '22 17:04 ThiloteE

@vithlaithla Please merge the latest upstream/main branch to ensure you have the latest checkstyle rules in place (and that the FetcherTests don't fail at this PR)

koppor avatar Apr 20 '22 22:04 koppor

@koppor I fixed the check style and it passed the check style test. However, the fetcher test failed here but passed on my fix branch 'revise-8277' on my repository. Which, I then merged to this branch. I read the log, and there seems to be a network problem if I'm not mistaken. Can I do something here?

vithlaithla avatar Apr 21 '22 13:04 vithlaithla

I just did some testing:

Test A)

  1. Have image

  2. choose image

    Result: No metadata got deleted. Comment: I think this is counterintuitive, as users explicitly press a button that says "delete metadata".

    • Suggestion A) Either delete all (JabRef supported) metadata fields in this case.
    • Suggestion B) OR change the button to read "Delete metadata from PDFs (as specified in preferences)" or something similar.

    This is a non-critical remark and could also be adressed in a follow up pull request.

Test B)

  1. Have following entry:

    @Article{test444differentfilet,
    author = {test444differentfile},
    title  = {title444},
    file   = {:test 444 (2021).pdf:PDF},
    }   
    
  2. Have following preferences: image

  3. Press: Tools > Delete metadata from PDFs

    Result: image

ThiloteE avatar Apr 21 '22 13:04 ThiloteE

Problem: Selecting the field "Do not write the following fields to XMP metadata" will turn out to be ambiguous to the user. It will fulfill three separate functions:

  1. Fields from entry will not be writtten to XMP metadata
  2. Existing metadata will be removed from the PDFs, IF the entry does have the same field as specified in the preferences included in the entry. (I know, this sounds complicated. How to explain to users?).
  3. ALL existing JabRef supported metadata fields will be removed from the PDFs, IF the option select all fields is selected.

This seems a lot for one single option and will lead to confusion. At the very least, there needs to be some kind of documentation. Maybe this can be separated into separate tickable preference?

Another option would be renaming. How about: "Prevent writing following metadata fields to PDFs / Remove following metadata fields from PDFs"?

Underneath the option, put following text:

      - Choosing this option and using "write metadata to PDF files" will not write the selected XMP metadata fields to PDFs. 
        Additionally, it will also remove existing metadata from PDFs, if one or multiple of the following fields
        are ALSO present in the entry, whose metadata is supposed to be written to PDF.
      - Selected XMP metadata fields will be completely removed from PDFs if choosing "Delete metadata from PDFs".

ThiloteE avatar Apr 21 '22 14:04 ThiloteE

I would be happy, if you could address my remarks!

ThiloteE avatar Apr 21 '22 14:04 ThiloteE

@ThiloteE Regarding the unsuccessful metadata removal, we suspect the the reason that only author and title field are deleted is because they are included in the entry.

So when deleting, we don't need to consider any information about the library data of entry (whether fields to be deleted are also in the entry...). Is that right?

Regarding ambiguity, we are leaning more toward the second option (Clearer description) because it seems simpler for us.

  • Suggestion A) Either delete all (JabRef supported) metadata fields in this case.

+1 for that. - I think, the intention of the issue is to clean-up the meta data of a PDF (as you shown in the screenshots).

The preference setting was more a privacy setting - maybe, we should remove that setting in the future. --> @ThiloteE Let's discuss the use cases and workflows soon.

koppor avatar Apr 21 '22 19:04 koppor

So when deleting, we don't need to consider any information about the library data of entry (whether fields to be deleted are also in the entry...). Is that right?

  • When pressing the button Tools > Delete metadata from PDFs, it does NOT need to be considered. (But one should be able to choose which fields are supposed to be deleted.)
  • When pressing the button Tools > Write metadata to PDF files fields within the entry SHOULD be considered.
    • Maybe we can drop existing fields being deleted when pressing this button, though. Why delete fields, when we press a button that says "write metadata ..."?, when we now we have the button "delete metadata from PDFs" instead, right? So this functionality is maybe not necessary anymore.

Ideally, I think the following (or something similar) would be the least confusing solution: image

But I know you have done a lot! Feel free to do whatever you think will work and whatever you think you may manage in your given time or with the knowledge you possess! :-)

ThiloteE avatar Apr 21 '22 23:04 ThiloteE

@ThiloteE In this change, we now don't consider whether fields are also in entries when deleting. Demo:

  1. Have following entry @Misc{, author = {Jefferson}, year = {2022}, file = {:JabRef_multipleMetaEntries.pdf:PDF}, }

  2. Have following preferences privacy settings

  3. PDF file's initial XMP metadata

  • linked file view: intial_metadata1

  • exiftool result: initial_exiftool_result

  1. Press: Tools > Delete metadata from PDFs

  2. PDF file's XMP metadata after deleting

  • linked file view: modified_metadata

  • exiftool result: modified_exiftool_result

Regarding ambiguity, we added the description: privacy settings

Problem/Confusion we have:

  1. Linked file view model's XMP metadata column is empty and it says "Could not extract Metadata from XMP metadata". This is probably because all XMP metadata have been deleted. We are not sure if this behavior is desirable.
  2. We have difficulty passing the fetcher test. Do you have any suggestion for this problem?

I will try it. What do I have to do to start "Linked file view model"? E.g. how can I find the dialogue for what you show in the picture beneath 5.?

ThiloteE avatar Apr 22 '22 13:04 ThiloteE

@ThiloteE Here is how you can access it

linked file view

Test C:

image BEFORE:

image

I then clicked on Tools > Write metadata to PDF files (F6)

AFTER: image

Exiftool AFTER:

image

ThiloteE avatar Apr 23 '22 13:04 ThiloteE

Test D: I went to get another fresh PDF (same data was attached to the PDF as in test C BEFORE doing stuff) image

I then clicked Tools > Delete metadata from PDFs

AFTER:

image

exiftool AFTER: image

ThiloteE avatar Apr 23 '22 13:04 ThiloteE

Ah, and I had a lot of Typos in my text. It does not really look good.

Here, this is a better text:

Underneath the option, put following (or something similar) text and format it in a way that looks nice! :D

      - Choosing this option and using "write metadata to PDF files" will not write the selected XMP metadata fields to PDFs. 
        Additionally, it will also remove existing metadata from PDFs, if one or multiple of the following fields
        are ALSO present in the entry, whose metadata is supposed to be written to PDF.
      - Selected XMP metadata fields will be completely removed from PDFs if choosing "Delete metadata from PDFs".

To do in a follow up pull-request:

  • [ ] Add this text to the JabRef documentation page and link to it via following button:

    image

    But, this can be done later, not now.

ThiloteE avatar Apr 23 '22 13:04 ThiloteE

But looking already pretty good. You are definitely getting closer to the desired outcome!

ThiloteE avatar Apr 23 '22 13:04 ThiloteE

got it, we will follow up very soon

vithlaithla avatar Apr 24 '22 16:04 vithlaithla

what is the status here?

ThiloteE avatar May 24 '22 05:05 ThiloteE

@ThiloteE There is a delay on the work, but we are still working on it! We will let you know once we have the changes.

vithlaithla avatar May 25 '22 11:05 vithlaithla

ok thanks! :)))

ThiloteE avatar May 25 '22 11:05 ThiloteE

any update here?

calixtus avatar Aug 14 '22 05:08 calixtus

Attempt at summarizing this PR:

As far as I remember

Done:

  • [x] 1. Deleting XMP metadata mostly WORKS.

To Do:

  • [ ] 2. This PR causes a regression and breaks "parse metadata from pdf" in "General" tab of entry editor. See https://github.com/JabRef/jabref/pull/8681#issuecomment-1107469686.

    • The regression was caused by one of the latest commits.
    • JabRef and Exiftool now show different attached metadata.
    • When working on this PR, don't trust JabRef. Exiftool seems to be more reliable.
  • [ ] 3. "Linked file view model's XMP metadata column is empty and it says "Could not extract Metadata from XMP metadata". This is probably because all XMP metadata have been deleted. We are not sure if this behavior is desirable." - See https://github.com/JabRef/jabref/pull/8681#issuecomment-1106503459

To Do In Follow Up PR:

  • [ ] 4. Some exceptions to deleting XMP metadata remain. Some fields cannot be deleted yet. See https://github.com/JabRef/jabref/pull/8681#issuecomment-1107472121

  • [ ] 5. Explanation of preferences looks slightly out of place/ugly. See https://github.com/JabRef/jabref/pull/8681#issuecomment-1107473608

    • If push comes to shove, the explanation could be moved to the docs

ThiloteE avatar Nov 16 '22 11:11 ThiloteE

Closing this issue due to inactivity :zzz: Please reopen if you plan to continue on this PR

calixtus avatar Apr 03 '23 14:04 calixtus