backend icon indicating copy to clipboard operation
backend copied to clipboard

updated topic creation dataflow diagram

Open rahulbot opened this issue 4 years ago • 5 comments

As part of the ongoing documentation / support process, I created an updated data flow diagram to chart how data flows when multi-platform topics are created. I think will be helpful as another resource to provide to our researchers, and when we roll this out more broadly.

Can you take a look at the attached and let me know if you see any errors or major omissions?

MC Topic Creation Dataflow.pdf

rahulbot avatar Jun 10 '20 18:06 rahulbot

This is a great start, Rahul. At a glance, it looked great, but I think it is actually missing a lot of what the topic system does. Apologies if I'm being overly critical.

Comments:

  • We are pulling from google web search, not google news.
  • We should include the csv import as well as a source.
  • The spidering process as described is missing the relevancy pattern matching, so it looks like we are just importing all urls into the topic (as issue crawler does).
  • The spidering process is not distinct from the html extraction. We actually match for relevancy against the raw html (and throw away anything that doesn't match), then extract content from the html and create an actual story in our database, then check for relevancy again and only add to the topic if the extracted content matches. We do this for performance optimization (it's much cheaper to do an html regex match than to do the html content extraction). The most important thing to make clear to the reader is that we are ultimately doing relevancy matching against the extracted content before adding a story to the topic.
  • We don't ever deduplicate based on content. We only dedup (or match, depending on how you see it) based on normalized urls and normalized titles.
  • This doesn't capture: a) the fact that we are storing hyperlinks and processing them into graph data and metrics, b) network map generation, c) subtopic generation, d) timespan based analysis, e) frozen snapshot generation, f) generation url sharing subtopics, g) generation of url sharing metrics within larger topic, h) date guessing.

hroberts avatar Jun 10 '20 20:06 hroberts

Yeah, I know. I was trying to not get too much detail, but also include a bunch. So I think I've ended up with a rather arbitrary list of the things included vs. excluded. For instance, I intentionally didn't include any of the snapshotting process (subtopics, timespans) because I thought that wouldn't be that useful to know from a metadata/data gathering perspective. Lemme take another pass at including some of those steps and corrections that are pre-snapshotting. I'm still not sure how much detail is useful to a researcher trying to understand how the data is gathered and filtered (vs. to a developer). There's so much going on in the system that it'll take a few rounds to catch it all one a single diagram I think 👍🏽

rahulbot avatar Jun 10 '20 23:06 rahulbot

I took another pass at adding in more of the features of the full topic mapper engine. Give this one a look over for errors/omissions. MC Topic Creation Dataflow-2.pdf

rahulbot avatar Jun 12 '20 19:06 rahulbot

This looks great. Very impressive work, rahul!

hroberts avatar Jun 17 '20 18:06 hroberts

Thx 👍🏽 I'm gonna share it with the rest of the Civic MC team to get feedback. After another rev or two it'll be ready to add into the repo somewhere.

rahulbot avatar Jun 17 '20 18:06 rahulbot