metaforecast icon indicating copy to clipboard operation
metaforecast copied to clipboard

[RFC] New event-based `history` storage

Open berekuk opened this issue 3 years ago • 1 comments

This is something I've had in mind for the last few days, but it's still incomplete.

Right now the history table is populated with question snapshots. This seems suboptimal:

  • We'll probably want to normalize the questions data eventually, e.g. extract options in a separate table (for performance and to unlock the possibility of more complex SQL queries), and it's unclear how to adapt the current history schema for that
  • history table is denormalized and includes a lot of duplicate data; also, it grows proportionally to the frequency with which we fetch the sources, and so doesn't play well with plans from #35
  • Performance will probably suffer too; this might affect #28, though I'm not sure by how much

Alternative: implement an event-based storage which tracks only the changes in fields.

E.g., list of fields for the new table:

  • pk (serial id)
  • question_id
  • field (can be title, description, stars)
  • value (new value)
  • timestamp

Unique index by question_id + field.

This table would be populated only if the field value has changed. If the field hasn't changed from the previous fetch then there's no need to save it again.

This proposal is incomplete:

  • it doesn't explain how to track "deep" properties, e.g. if question had a change in one of the option titles, it's unclear what to put in field
  • I'm still confused on "we just fetched the new question data with its entire forecasts history from the platform" (because the platform provides the historical data) vs "we fetched the new question data and store its snapshot in our history table" — these are two different scenarios, ideally we need to handle both and abstract it away from the end users

I'll think about this some more before doing any code changes, and I'll wait until I become more familiar with the specifics of different platforms that we support. Just throwing this idea out there to gestate for now.

berekuk avatar Apr 07 '22 21:04 berekuk

Also, this is a better approach if we ever get more platforms with realtime capabilities (e.g., with webhooks for every event that happens on the platform), or if we implement pseudo-realtime capabilities ourselves (e.g., "fetch metaculus frontpage ordered by activity, and refetch only the questions which we haven't seen yet").

berekuk avatar Apr 07 '22 21:04 berekuk