winmerge icon indicating copy to clipboard operation
winmerge copied to clipboard

FR: Support comparing PDF files

Open Gitoffthelawn opened this issue 4 years ago • 28 comments

It would be incredible if WinMerge supported comparing PDF files.

Even if it isn't perfect, it will be a great feature.

Thank you again for WinMerge!

Gitoffthelawn avatar Apr 03 '21 09:04 Gitoffthelawn

You can use the xdocdiff WinMerge plugin to extract and compare text from PDF files.

http://freemind.s57.xrea.com/xdocdiffPlugin/en/index.html

(WinMerge version 2.16.13 includes a new ApacheTika plugin that allows you to extract and compare text content from PDF files.)

You can also compare PDF files as image files by specifying the .pdf extension in the "Image File Patterns" combo box in the Compare/Image category of the Options dialog.

image

sdottaka avatar Apr 03 '21 12:04 sdottaka

@sdottaka Thank you! I will take a look at both of your suggestions, and report back after I have done so.

I still think a proper PDF compare will be great, but one of these ideas might be a big improvement.

Is there anything similar for JSON files, or should I create another issue report for that format?

Gitoffthelawn avatar Apr 04 '21 10:04 Gitoffthelawn

For JSON and XML files, I am planning to create a Formatter plugin.

sdottaka avatar Apr 04 '21 13:04 sdottaka

For JSON and XML files, I am planning to create a Formatter plugin.

Wonderful! I'm guessing the XML plugin will be able to handle HTML well. Is that correct?

Gitoffthelawn avatar Apr 04 '21 23:04 Gitoffthelawn

I would like to be able to handle HTML files as well.

sdottaka avatar Apr 05 '21 00:04 sdottaka

Here is another HTML plugin

jrathlev avatar Apr 07 '21 11:04 jrathlev

xdocdiffPlugin_1_0_6d.zip seems to have Trojan.Malware.300983.susgen Malware virus, according to MaxSecure on VirusTotal.com.

Has anyone else noticed this?

outsidecoder avatar Jul 20 '21 23:07 outsidecoder

WinMerge supports comparing PDF files via the Apache Tika plugin. BUT it is broken because the URLs to download Tika and JAI are broken.

The fix is relatively easy, you need to find the tika.bat which is installed under your WinMerge program folder in the 'Commands\Apache-Tika' folder. e.g. 'C:\Program Files\WinMerge\Commands\Apache-Tika' (if you've installed the x64 version)

You will probably need elevated privileges to backup and edit 'tika.bat'

Just in case, backup tika.bat before you edit it.

Replace the contents with the following

@echo off
setlocal EnableDelayedExpansion
set TikaVer=2.1.0
set JaiVer=1.4.0
set TikaJar=tika-app-%TikaVer%.jar
set JaiJar=jai-imageio-jpeg2000-%JaiVer%.jar
set DOWNLOAD_URL=https://repo1.maven.org/maven2/org/apache/tika/tika-app/%TikaVer%/%TikaJar%
set DOWNLOAD_URL_JPEG2000=https://repo1.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/%JaiVer%/%JaiJar%
set TIKA_PATH=Commands\Apache-Tika\%TikaJar%
set JAI_IMAGEIO_JPEG2000_PATH=WinMerge\Commands\Apache-Tika\%JaiJar%
set MESSAGE='Apache Tika is not installed. Do you want to download it and its dependences from %DOWNLOAD_URL% and %DOWNLOAD_URL_JPEG2000%?'
set TITLE='Apache Tika Plugin'

cd "%APPDATA%\WinMerge"
if not exist %TIKA_PATH% (
  cd "%~dp0..\.."
  if not exist %TIKA_PATH% (
    mkdir "%APPDATA%\WinMerge" 2> NUL
    cd "%APPDATA%\WinMerge"
    for %%i in (%TIKA_PATH%) do mkdir %%~pi 2> NUL
    powershell "if ((New-Object -com WScript.Shell).Popup(%MESSAGE%,0,%TITLE%,1) -ne 1) { throw }" > NUL
    if errorlevel 1 (
      echo "download is canceled" 1>&2
    ) else (
      start "Downloading..." /WAIT powershell -command "[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; Invoke-WebRequest %DOWNLOAD_URL% -Outfile %TIKA_PATH%"
      start "Downloading..." /WAIT powershell -command "[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; Invoke-WebRequest %DOWNLOAD_URL_JPEG2000% -Outfile %JAI_IMAGEIO_JPEG2000_PATH%"
    )
  )
)
java -Xbootclasspath/a:%JAI_IMAGEIO_JPEG2000_PATH% -jar %TIKA_PATH% %3 %4 %5 %6 %7 %8 %9 "%~1" > "%~2"
```

JohnF-tfa avatar Oct 11 '21 02:10 JohnF-tfa

@JohnF-tfa Thank you for those details John!

Gitoffthelawn avatar Oct 11 '21 04:10 Gitoffthelawn

I find being able to do an image compare and a text compare of PDFs very useful.

JohnF-tfa avatar Oct 11 '21 04:10 JohnF-tfa

@JohnF-tfa wrote:

I find being able to do an image compare and a text compare of PDFs very useful.

Hmmm... what do you mean?

I could see a text compare perhaps being helpful because often many of the strings within a PDF file are stored as plain text. But how do you find an image compare helpful?

Gitoffthelawn avatar Oct 11 '21 08:10 Gitoffthelawn

@JohnF-tfa Thank you for the patch. I changed the URL in commit 2e5cdec. I'm sure the previous URL was fine before, but when I tried it now, it did indeed fail to download.

sdottaka avatar Oct 11 '21 11:10 sdottaka

@JohnF-tfa wrote:

I find being able to do an image compare and a text compare of PDFs very useful.

Hmmm... what do you mean?

I could see a text compare perhaps being helpful because often many of the strings within a PDF file are stored as plain text. But how do you find an image compare helpful?

I work for a company that sells (and integrates) software to medium\large businesses & government agencies that interfaces with their IT systems (like billing, loans, etc) that can create huge quantities of customised PDFs. When upgrading their software we need to compare PDFs from the new software with PDFs from their previous software to make sure that everything is exactly the same. Settings can vary that cause images (logo's etc) to be slightly different and by the time you get to the bottom of the page the issue has magnified. It's good to easily check that the page image as well as the page text are correct. Once upon a time I would have to print both PDF and hold them up to the light (overlay) to compare. This is not something that we do all the time, occasionally we get clients that want to upgrade from very old (unsupported) software to more current software using current technologies. Some of this software is nearly 30 years old and past EOL years ago but is still in use because it still works.

JohnF-tfa avatar Oct 12 '21 00:10 JohnF-tfa

Hi, @JohnF-tfa

You will probably need elevated privileges to backup and edit 'tika.bat'

Thank you for your guidance. According to your idea, I successfully called tika-app-2.4.1.jar to compare the PDF. But if you want to call jai-imageio-jpeg2000-1.4.0.jar, the following error is always reported:

CopyQ oUXsnL

May I ask if my settings are incorrect?

This is my tika.zip file, looking forward to your reply. Best regards.

tika.zip

tyf2018 avatar Aug 19 '22 21:08 tyf2018

Hi @sdottaka, As you know WinMerge already supports PDF files comparison via Apache Tika and it works beautifully!! (Thank you for your many contributions, btw!)

If possible, I would just recommend you to edit your previous comment on this issue and remove the mentions of docdiffPlugin.

That plugin was last updated in 2013, and many users (like myself) while searching for a PDF comparison tool are directed to this issue and while reading the comments may think that PDFs are not support out-of-the-box by WinMerge.

Best regards.

roimvargas avatar Feb 18 '23 20:02 roimvargas

The xdocdiffPlugin is still available and fast enough that we do not feel the need to remove it. I added a note about the ApacheTika plugin.

sdottaka avatar Feb 18 '23 23:02 sdottaka