web-monitoring Incorporate Cluster in the schema

trafficstars

An aspect of the conversation in SF that I didn't fully absorb until chatting with @aleatha is the importance of surfacing a Cluster of related Changes as a concept in the database and in the UI.

The app should request a set of Clusters of Changes for a user to check. The UI should present one representative Change with the option of drilling down into the rest. Here's one way we could adjust the schema. I haven't thought about this very long -- just trying to kick off discussion:

Cluster
    uuid
    priority
    created_at
    updated_at

Then adjust the Change table to add cluster_uuid (starts as NULL, is assigned later by an ETL job) and remove priority, which is a property of a Cluster, not a Change.

Where to put Annotation is a sticky question: does it belong to a Cluster or a Change? I think it's analogous to the (famously sticky) problem of regularly-occurring events on a calendar. If an Annotation belongs to a Change, then it's easy to customize individual Annotations when needed but it's hard to safely update the whole set. If Annotation belongs to a Cluster, we'd need to provide some UI for subdividing errant clusters into sub-clusters. My guess is that leaving an Annotation as a property of a Change is the easier place to start.

Mar 12 '17 17:03 danielballan

attn @Mr0grog @lightandluck @ambergman

Mar 12 '17 17:03 danielballan

How are the clusters generated? Prior to any interaction by an analyst?

Mar 14 '17 14:03 dcwalk

Yeah, I'm wondering that too. Is this an ML layer that maybe we don't have at first, but eventually are able to generate? or is the Versionista hash function (which identifies identical changes as a single "change") already an example of a cluster, which should be treated as such in the DB and the UI?

On 03/14/2017 10:36 AM, dcwalk wrote:

How are the clusters generated? Prior to any interaction by an analyst?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/web-monitoring/issues/20#issuecomment-286440528, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNM0p9zKmpM6FUP2eyDTUA6ziLX5Jks5rlqXdgaJpZM4ManGH.

Mar 14 '17 14:03 titaniumbones

Yes, prior to any interaction with an analyst. The Versionista hash function is already an example of a Cluster, which makes me think we can and should build the idea of a Cluster in from the start.

Mar 14 '17 14:03 danielballan

I don’t feel like I understand how a cluster is meaningfully different than a change, at least in how we’ve defined them. We already have the issue mentioned here just in thinking about changes by themselves. If we have versions (for a single page):

Date	uuid
2017-03-01	1
2017-03-02	2
2017-03-03	3
2017-03-04	4

We don’t have 3 changes, we have up to 6:

1 → 2 1 → 3 2 → 3 1 → 4 2 → 4 3 → 4

…and we already need to handle the situation where a user is concerned about all of these because we are analyzing changes in 3 day chunks. That is to say, we already have a cluster in the “4 → 1” change above (or the “4 → 2” or the “3 → 1” etc etc etc). Annotations can already be attached to all six of these and UI or other component using the DB has to account for this. (This is kind of why I was saying I was worried disambiguating changes and versions might make the API painful to use last week—this is a powerful concept, but really hard to create a sane UI around.)

Anyway, what I’m trying to say is: I think we already answered “does [an annotation] belong to a Cluster or a Change?” because we made those two concepts functionally the same.

More honestly, if I were going to rethink the schema again now, 5 days after the last discussion on it, I’d jam changes and versions back together the way I originally had them. Any annotation on a version would be meant to annotate [previous version] → [annotated version].

There’s no reason the API could then surface any virtual “cluster/changeset/timeseries/timeperiod/whathaveyou” that is a convenient way to sum up all the annotations across version X → version Y.

I’m also a little worried we are trying to push more concepts into our storage/database than belong.

Mar 14 '17 17:03 Mr0grog

I’m also a little worried we are trying to push more concepts into our storage/database than belong.

That point is well-taken. It's a balance between locking ourselves into assumptions that will be hard to pull apart later vs overbuilding before we fully understand the problem. I'm all for more discussion.

I think we're talking past each other a bit about the Clusters. A Cluster here refers to Changes from different Pages, not multiple Changes to the same Page. For example, it might contain 50 Changes that all had the same new link added to a sidebar. We want to present these as a group.

If I were going to rethink the schema again now, 5 days after the last discussion on it, I’d jam changes and versions back together the way I originally had them. Any annotation on a version would be meant to annotate [previous version] → [annotated version].

When Versions are coming from multiple sources, I think it might not be unusual for them to arrive into the system out of chronological order, so it's useful to know for sure which Versions a given Diff or Annotation was based on. "The previous one" could change when new data arrives, and then we'd have to manually search backward to figure out where the change occurred. To start, could the Rails app ignore the Change/Version distinction (just always use change.uuid_to) but leave it in the database in case we need it?

Mar 14 '17 18:03 danielballan

I think we're talking past each other a bit about the Clusters. A Cluster here refers to Changes from different Pages, not multiple Changes to the same Page.

Ah, this makes more sense now. So a cluster here is purely for analytic purposes and not meant to have anything to do with how we actually manage tracked pages and their versions? If so, I’ll bow out of this discussion; I would like to maintain web-monitoring-db’s focus on providing that which we are utterly failing at right now: a simple database & API for tracking page changes and saving annotations of them.

I think it might not be unusual for them to arrive into the system out of chronological order, so it's useful to know for sure which Versions a given Diff or Annotation was based on.

Yeah, I definitely understand this technical argument; it’s what we talked about last week and why I agreed about separating the version and change. Now that I have re-written the code for this twice in web-monitoring-db, I am feeling very much like we are adding a lot of complication in order to solve an edge case.

could the Rails app ignore the Change/Version distinction (just always use change.uuid_to) but leave it in the database in case we need it?

I think something along those lines makes sense. A version would have a theoretical 1:1 mapping with a change (i.e. the change representing [previous version] → [annotated version]), though there might be N “invalidated” changes. A change would be invalidated by adding a new version between two existing versions—i.e. if its uuid_from does not match [version identified by uuid_to].previous.uuid.

One of the things my second revision of all this (about to push it) does is to only create change records when annotating them. So in the case where we:

Create version 1
Create version 2
Create version 1.5 (between 1 and 2)
Annotate version 2’s change

…we only wind up with one change record (1.5 → 2). That should hopefully aid in reducing noise in this scenario.

Mar 14 '17 20:03 Mr0grog

@Mr0grog @danielballan - This is a really great conversation, and it might be worth discussing again sometime soon. If I'm understanding it properly, I think the solution you suggested for change records makes a lot of sense.

Regarding the Clusters, I agree with @danielballan that The Versionista hash function is our first example of a Cluster, and so it does make sense to build the idea of a Cluster in from the start.-In particular, in terms of how the end-user will interact with a Cluster, I've suggested an "Identical Diffs View" in my preliminary mockups (on page 6 of the linked slide deck in the issue). What I haven't included, however, is any kind of interface (either for an admin or a regular user), as Dan was suggesting, for sub-dividing errant clusters. I don't think that needs to occur in version 1, but I think this "Identical Diffs View" would be the place to do that, when we do want the functionality.

As a slightly separate point - I know you were saying you didn't want to mess more with the current DB, Rob, and I think that makes total sense. But I do think it might be nice to have another DB set up to actually store the information about Clusters, without actually storing any HTML data. This would just mean storing something like lists of pointers to Changes - meaning the page, two snapshot dates, and their sources OR just the stored locations of two snapshots. I think this will be especially important if we do want to start being able to sub-divide Clusters more manually or make Clusters in a few different ways.

I'm still willing to believe, as you initially suggested Dan, that it makes sense to annotate the Change and not the Cluster, especially since we may end up breaking things up (or, potentially, even having different levels of Clusters at some point). But may all have to change at some point as well.

Mar 15 '17 06:03 ambergman

@Mr0grog More thoughts on the "Changes" table:

We can maintain all the information I am advocating for without actually having a Changes table. Would it be better to make Diffs and Annotations point to the two versions directly (uuid_to, uuid_from)? This probably saves us a network hit (and schema complexity) at the cost of having to carry around two uuids instead of one. I can't remember if we explicitly considered this possibility. If you like it better, let's do it.

I thought of another reason we might want this info, more compelling than my un-chronological edge case argument: Maybe the GUI wants to look at Diffs spanning a couple Versions, to help analysts zero in on where given change occurred. It would be nice to have a way to express that.

Mar 17 '17 20:03 danielballan

Sorry for the slow reply here; I had some last-minute travel come up on Wednesday and just got back yesterday.

I’ve updated https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/15 with bits and pieces of work I did while traveling.

Would it be better to make Diffs and Annotations point to the two versions directly (uuid_to, uuid_from)?

I’m not sure that would be great. I do think having a single thing to attach them all to is helpful. What I wound up doing over on the PR for this is:

The API (and model objects) for versions have shortcuts to get/create annotations. You can ask for:

page/{page id}/versions/{version id}/annotations (or POST a new annotation to it)

…and it will be the same as asking for:

page/{page id}/versions/{version id}/changes/{id of change between version and previous version}/annotations
Similarly, the API response for a version includes a current_annotation field, which is just hte current_annotation from the change from the most recent previous version.
There is no diff record. As I’m thinking about this, I’m not sure it makes any sense for us to have one. Insofar as we might be testing different diffs, I’m feeling like it should be up to the UI to choose what to show. If we want to have support for it in the DB API, it can just be a read-through cache to whatever differ is requested.

…should we move any further conversation on this (as opposed to clustering) over to https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/15 ?

Mar 20 '17 19:03 Mr0grog

Sure we'll continue conversation there.

Mar 20 '17 19:03 danielballan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Jan 09 '19 22:01 stale[bot]

The changes issues discussed here are all long since done and implemented, but the clustering is important and still relevant. I feel like it’s lacking a clear example use-case in this discussion, so here’s one that we encounter often:

A change was made to the site-wide nav, so hundreds or even thousands of pages have a “change,” even though those changes are all the same and only need to be looked at once, not once for each page. (To get even more real: we just experienced this in an even-more-intense-than-usual way with gov’t shutdown notices.)

Not sure if this deserves a new issue so we can escape from the mess of conversation about “Changes” above. @danielballan?

Jan 10 '19 00:01 Mr0grog

web-monitoring web-monitoring copied to clipboard

Incorporate Cluster in the schema

web-monitoring
web-monitoring copied to clipboard