winmerge
winmerge copied to clipboard
FR: Support comparing PDF files
It would be incredible if WinMerge supported comparing PDF files.
Even if it isn't perfect, it will be a great feature.
Thank you again for WinMerge!
You can use the xdocdiff WinMerge plugin to extract and compare text from PDF files.
http://freemind.s57.xrea.com/xdocdiffPlugin/en/index.html
(WinMerge version 2.16.13 includes a new ApacheTika plugin that allows you to extract and compare text content from PDF files.)
You can also compare PDF files as image files by specifying the .pdf extension in the "Image File Patterns" combo box in the Compare/Image category of the Options dialog.

@sdottaka Thank you! I will take a look at both of your suggestions, and report back after I have done so.
I still think a proper PDF compare will be great, but one of these ideas might be a big improvement.
Is there anything similar for JSON files, or should I create another issue report for that format?
For JSON and XML files, I am planning to create a Formatter plugin.
For JSON and XML files, I am planning to create a Formatter plugin.
Wonderful! I'm guessing the XML plugin will be able to handle HTML well. Is that correct?
I would like to be able to handle HTML files as well.
Here is another HTML plugin
xdocdiffPlugin_1_0_6d.zip seems to have Trojan.Malware.300983.susgen Malware virus, according to MaxSecure on VirusTotal.com.
Has anyone else noticed this?
WinMerge supports comparing PDF files via the Apache Tika plugin. BUT it is broken because the URLs to download Tika and JAI are broken.
The fix is relatively easy, you need to find the tika.bat which is installed under your WinMerge program folder in the 'Commands\Apache-Tika' folder. e.g. 'C:\Program Files\WinMerge\Commands\Apache-Tika' (if you've installed the x64 version)
You will probably need elevated privileges to backup and edit 'tika.bat'
Just in case, backup tika.bat before you edit it.
Replace the contents with the following
@echo off
setlocal EnableDelayedExpansion
set TikaVer=2.1.0
set JaiVer=1.4.0
set TikaJar=tika-app-%TikaVer%.jar
set JaiJar=jai-imageio-jpeg2000-%JaiVer%.jar
set DOWNLOAD_URL=https://repo1.maven.org/maven2/org/apache/tika/tika-app/%TikaVer%/%TikaJar%
set DOWNLOAD_URL_JPEG2000=https://repo1.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/%JaiVer%/%JaiJar%
set TIKA_PATH=Commands\Apache-Tika\%TikaJar%
set JAI_IMAGEIO_JPEG2000_PATH=WinMerge\Commands\Apache-Tika\%JaiJar%
set MESSAGE='Apache Tika is not installed. Do you want to download it and its dependences from %DOWNLOAD_URL% and %DOWNLOAD_URL_JPEG2000%?'
set TITLE='Apache Tika Plugin'
cd "%APPDATA%\WinMerge"
if not exist %TIKA_PATH% (
cd "%~dp0..\.."
if not exist %TIKA_PATH% (
mkdir "%APPDATA%\WinMerge" 2> NUL
cd "%APPDATA%\WinMerge"
for %%i in (%TIKA_PATH%) do mkdir %%~pi 2> NUL
powershell "if ((New-Object -com WScript.Shell).Popup(%MESSAGE%,0,%TITLE%,1) -ne 1) { throw }" > NUL
if errorlevel 1 (
echo "download is canceled" 1>&2
) else (
start "Downloading..." /WAIT powershell -command "[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; Invoke-WebRequest %DOWNLOAD_URL% -Outfile %TIKA_PATH%"
start "Downloading..." /WAIT powershell -command "[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; Invoke-WebRequest %DOWNLOAD_URL_JPEG2000% -Outfile %JAI_IMAGEIO_JPEG2000_PATH%"
)
)
)
java -Xbootclasspath/a:%JAI_IMAGEIO_JPEG2000_PATH% -jar %TIKA_PATH% %3 %4 %5 %6 %7 %8 %9 "%~1" > "%~2"
```
@JohnF-tfa Thank you for those details John!
I find being able to do an image compare and a text compare of PDFs very useful.
@JohnF-tfa wrote:
I find being able to do an image compare and a text compare of PDFs very useful.
Hmmm... what do you mean?
I could see a text compare perhaps being helpful because often many of the strings within a PDF file are stored as plain text. But how do you find an image compare helpful?
@JohnF-tfa Thank you for the patch. I changed the URL in commit 2e5cdec. I'm sure the previous URL was fine before, but when I tried it now, it did indeed fail to download.
@JohnF-tfa wrote:
I find being able to do an image compare and a text compare of PDFs very useful.
Hmmm... what do you mean?
I could see a text compare perhaps being helpful because often many of the strings within a PDF file are stored as plain text. But how do you find an image compare helpful?
I work for a company that sells (and integrates) software to medium\large businesses & government agencies that interfaces with their IT systems (like billing, loans, etc) that can create huge quantities of customised PDFs. When upgrading their software we need to compare PDFs from the new software with PDFs from their previous software to make sure that everything is exactly the same. Settings can vary that cause images (logo's etc) to be slightly different and by the time you get to the bottom of the page the issue has magnified. It's good to easily check that the page image as well as the page text are correct. Once upon a time I would have to print both PDF and hold them up to the light (overlay) to compare. This is not something that we do all the time, occasionally we get clients that want to upgrade from very old (unsupported) software to more current software using current technologies. Some of this software is nearly 30 years old and past EOL years ago but is still in use because it still works.
Hi, @JohnF-tfa
You will probably need elevated privileges to backup and edit 'tika.bat'
Thank you for your guidance. According to your idea, I successfully called tika-app-2.4.1.jar to compare the PDF. But if you want to call jai-imageio-jpeg2000-1.4.0.jar, the following error is always reported:

May I ask if my settings are incorrect?
This is my tika.zip file, looking forward to your reply. Best regards.
Hi @sdottaka, As you know WinMerge already supports PDF files comparison via Apache Tika and it works beautifully!! (Thank you for your many contributions, btw!)
If possible, I would just recommend you to edit your previous comment on this issue and remove the mentions of docdiffPlugin.
That plugin was last updated in 2013, and many users (like myself) while searching for a PDF comparison tool are directed to this issue and while reading the comments may think that PDFs are not support out-of-the-box by WinMerge.
Best regards.
The xdocdiffPlugin is still available and fast enough that we do not feel the need to remove it. I added a note about the ApacheTika plugin.