web-monitoring-processing icon indicating copy to clipboard operation
web-monitoring-processing copied to clipboard

Create an analyzer that checks for simple, ignorable non-text changes

Open Mr0grog opened this issue 6 years ago • 4 comments

As a first test of all the things needed to automatically rate a change’s significance, priority, let’s start with something simple that looks for changes that we can pretty confidently say aren’t meaningful:

  • No changes to the page’s text (except whitespace changes and punctuation, like ')
  • Attribute changes that are not for title, alt, href, or src (any others?) are not important

Example: https://monitoring.envirodatagov.org/page/b2b0b8cb-5e9b-4178-91c0-b8cb4466d2bd/b76dd1ab-a7aa-41d6-89f3-c45117a80dc5..2b55beed-db97-4249-b30a-600f61d94eb5

This is an easy analysis to do (and covers a lot of the kinds of changes I think we see), so it’s a good way to make sure we’ve built out:

  • a working pipeline for proactively analyzing new versions added to the DB
  • a style and format for organizing analysis code

Mr0grog avatar Mar 19 '18 20:03 Mr0grog

At this weeks analyst meeting, CAPTHAs came up as another constantly changing thing that is hopefully easy to identify.

Also:

  • ASP.net postback/session data
  • Invisible form fields (would cover the above ASP.net stuff)

More far out:

  • Simple heuristics for identifying “related links” sections?
  • Allowing selectors for sections of the page to ignore as an argument?
    • To be usable, we need to add the ability to store a list of ignorable selectors in DB, but that’s separate work

We should probably turn this issue into an umbrella/epic issue for all these different ideas and pieces of work.

Mr0grog avatar Aug 01 '18 17:08 Mr0grog

From some BLM examples @jschell42 sent me:

  • Cache-breaking hashes/unique values in subresource URLs (e.g. for CSS, JS)

  • Changes to id, class, name attributes (and moving those attributes).

  • Changes to title attributes probably should be accounted for somehow, but a) are hard to see and b) probably aren’t a big deal (so they should only matter a tiny bit, if they matter at all).

  • Addition/removal of empty title or maybe any attribute? (Might need a special list of attributes that have meaning just by their presence, like checked.)

  • Maybe just anything that’s non-text/image?

  • Amount of textual change?

    • % of total words?
    • Simhash?
    • Zhang-shasha?
    • ?
  • <meta> modified date? e.g:

    <meta name="dcterms.modified" content="2018-06-11T11:58:04-04:00" />
    

There’s definitely an interesting thing here I wasn’t thinking about before… we could make a big split in prioritization based simply on textual (+ images and such) content changes. I can see some super-useful annotation data we could display for analysts (especially in their sheets) like:

  • Did text change? y/n
  • % text change

Some diffs for examples:

  • https://monitoring.envirodatagov.org/page/06347058-b727-468b-910f-d0bb1ff7a765/5e895b22-938d-45f7-bb24-2bd7fc0885a7..9c032ddc-6bfe-4d99-a650-5c81a38bdeea
  • https://monitoring.envirodatagov.org/page/0a00eb76-2d5a-49ba-99a3-e532f4306690/e7f4b907-1e9f-48e9-84e2-90660ab4e1c9..74fb35e8-9019-4488-8a24-0401f8476c1d
  • https://monitoring.envirodatagov.org/page/3ce1df73-126a-446c-b17e-2933fa42596b/864a4f4b-ac8b-402b-8241-162942050966..43491156-0a5d-48ce-a0ef-b0ef0f1c2323
  • https://monitoring.envirodatagov.org/page/3d012582-b86c-4d00-85ba-8322a4c6b4d0/56f02206-6f54-46cf-9924-8ac82037f11d..c85ec25e-0c35-404f-849e-f4f51b771217

Mr0grog avatar Aug 30 '18 17:08 Mr0grog

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

stale[bot] avatar Mar 25 '19 16:03 stale[bot]

Another example of something that should really be totally ignored: https://monitoring.envirodatagov.org/page/c4328d30-cada-452f-8642-4bff721f5fc2/9a448c37-9285-4107-9ffd-ea72214561a4..a8fab661-07bb-4409-92f7-f73deadf4e29 (change to class attribute)

Mr0grog avatar Mar 27 '19 01:03 Mr0grog