web-monitoring icon indicating copy to clipboard operation
web-monitoring copied to clipboard

Create a service to diff PDF files

Open Mr0grog opened this issue 8 years ago • 28 comments

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

  • a: A URL for the “before” version of the PDF
  • b: A URL for the “after” version of the PDF

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

  • https://github.com/vslavik/diff-pdf
  • https://github.com/JoshData/pdf-diff

Mr0grog avatar Jun 01 '17 04:06 Mr0grog

Here’s an example of a small, hard to see change: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/2d2ccc52-f467-4775-a034-bea5271c0b9f Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11512540.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11522529.pdf

Here’s an interesting graphic page with changes: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/c0307603-0bae-4a6c-bf12-52cc6482b0bc Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-9608983.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-11239564.pdf

Here’s one that’s just hard to scan by eye because it’s mostly reams of data: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/3edef8ea-de3f-4771-89f2-92840dad026b Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-9920428.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-10713675.pdf

And another: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/563b013c-883f-4099-8c98-ce6059a0b823 Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11023958.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11255938.pdf

Mr0grog avatar Jun 01 '17 08:06 Mr0grog

I'm looking into this; see how I get on!

neiljp avatar Jun 01 '17 21:06 neiljp

@neiljp Awesome, thanks so much!

Mr0grog avatar Jun 01 '17 22:06 Mr0grog

I have the second library working for all 4 examples you listed. I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out. Would it be helpful to show these images somewhere?

neiljp avatar Jun 01 '17 23:06 neiljp

Sure! Go ahead and post them here. If you have this work in a repo, go ahead and link it, too.

Mr0grog avatar Jun 01 '17 23:06 Mr0grog

Are you on the Archivers Slack group? There’s more “live” conversation there and workflow, process, etc.

Mr0grog avatar Jun 01 '17 23:06 Mr0grog

I'm generally not on Slack; is there an IRC mirror somewhere?

neiljp avatar Jun 01 '17 23:06 neiljp

These are the results I have for the 4 tests, with the caveats as above: 1 2 3 4

neiljp avatar Jun 02 '17 00:06 neiljp

These are wonderful. :thumbsup: 🎉

Unfortunately, I don’t think there is any mirror of the Slack :\

Mr0grog avatar Jun 02 '17 00:06 Mr0grog

I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out.

No worries. I should have been clearer that this doesn’t have to be perfect. Even if there are false positives, being able to identify space people can definitely ignore is a big deal. This is super, super helpful.

Mr0grog avatar Jun 02 '17 00:06 Mr0grog

Hey @neiljp this looks great, thx. Great to have new people stepping in!

We have been talking about an IRC bridge for a while but haven't set one up - doh!

titaniumbones avatar Jun 02 '17 01:06 titaniumbones

@neiljp I’m headed out for the night, but will be back on tomorrow at 9-ish Pacific Time if you are planning to do more work on it. I will also try and sign into the global sprint Gitter.im if you are using that (I did not do a good job of paying attention to it today, sorry).

Looking forward to getting this integrated as a running service!

Mr0grog avatar Jun 02 '17 02:06 Mr0grog

@Mr0grog I'm back and on the gitter chat now. Re chat: I'm currently on IRC (freenode, oftc), matrix.org and also experimenting with zulip (after some pycon sprints). While I'm moving on with looking into this, were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

neiljp avatar Jun 02 '17 16:06 neiljp

were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

No—diffing PDFs is something that we simply haven't had time to get to at all yet.

In general, we haven’t found any great diffing services that either we can deploy feasibly or third party ones that we can integrate with and easily display the diff results in our own UI alongside forms and other visualizations for analysts.

Mr0grog avatar Jun 02 '17 16:06 Mr0grog

Progress today has my flask implementation (locally) working with the library and generating a png in the browser; how would you deploy that? I could try and deploy to a server I have access to, in theory.

neiljp avatar Jun 02 '17 20:06 neiljp

We don’t have a great deploy process for anything that’s not Heroku yet—it’s very ad-hoc on Amazon EC2. If you can deploy to a server you manage and document the process, that’d be great.

Mr0grog avatar Jun 02 '17 20:06 Mr0grog

Apparently flask works on heroku; the trick might be installing the other module(s), including one that I built as binary, though might not strictly need to be.

neiljp avatar Jun 02 '17 21:06 neiljp

Ah, yeah, binaries can be complicated on Heroku. You have to create a “buildpack:” https://devcenter.heroku.com/articles/buildpacks

Mr0grog avatar Jun 02 '17 22:06 Mr0grog

@neiljp Did you get anywhere on this? If not, do you mind posting what code you’ve got somewhere so others can help on this? Thanks!

Mr0grog avatar Jun 05 '17 16:06 Mr0grog

@Mr0grog I didn't get any further than getting it to work locally in the end, but have submitted some PRs against the lib I used, and hope to document the process ASAP.

neiljp avatar Jun 06 '17 05:06 neiljp

@neiljp Any updates on this?

Mr0grog avatar Jun 20 '17 18:06 Mr0grog

@Mr0grog Apologies, I got swept up in contributing to Zulip after PyCon. I'm now getting back to this, though I note there is other progress?

neiljp avatar Jul 11 '17 01:07 neiljp

@neiljp Yeah, we sorta have a more defined way to do this now. You can add your work as a module in the https://github.com/edgi-govdata-archiving/web-monitoring-processing repo, in the web_monitoring folder. There’s not much documentation on how the built-in diff server there works yet, but you can look at PR #59 in that repo. @danielballan can probably also help you out.

Mr0grog avatar Jul 11 '17 18:07 Mr0grog

Hey, @neiljp, just checking in. Any updates or anything I can help with here?

Mr0grog avatar Jul 18 '17 22:07 Mr0grog

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

stale[bot] avatar Jan 23 '19 18:01 stale[bot]

Well, this is still pretty critical. It would be lovely to get some help from someone on this, but it does need to get done.

Mr0grog avatar Jan 23 '19 19:01 Mr0grog

Hey if the issue still alive, I will like to contribute.

0xrishabh avatar Dec 23 '19 17:12 0xrishabh

Hey @cYph3r1337, that would be great. These days, all the diff-related code lives in the web-monitoring-processing repo in the web_monitoring/diff directory.

You can then make your differ accessible via HTTP by adding it to the server here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L30-L53 Basically, this just maps a part of the URL path to a function. The server will examine your argument names to figure out what to send it. More info on that here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L455-L465

Mr0grog avatar Jan 02 '20 21:01 Mr0grog