[RFC] New event-based `history` storage
This is something I've had in mind for the last few days, but it's still incomplete.
Right now the history table is populated with question snapshots. This seems suboptimal:
- We'll probably want to normalize the questions data eventually, e.g. extract
optionsin a separate table (for performance and to unlock the possibility of more complex SQL queries), and it's unclear how to adapt the currenthistoryschema for that historytable is denormalized and includes a lot of duplicate data; also, it grows proportionally to the frequency with which we fetch the sources, and so doesn't play well with plans from #35- Performance will probably suffer too; this might affect #28, though I'm not sure by how much
Alternative: implement an event-based storage which tracks only the changes in fields.
E.g., list of fields for the new table:
pk(serial id)question_idfield(can betitle,description,stars)value(new value)timestamp
Unique index by question_id + field.
This table would be populated only if the field value has changed. If the field hasn't changed from the previous fetch then there's no need to save it again.
This proposal is incomplete:
- it doesn't explain how to track "deep" properties, e.g. if question had a change in one of the option titles, it's unclear what to put in
field - I'm still confused on "we just fetched the new question data with its entire forecasts history from the platform" (because the platform provides the historical data) vs "we fetched the new question data and store its snapshot in our history table" — these are two different scenarios, ideally we need to handle both and abstract it away from the end users
I'll think about this some more before doing any code changes, and I'll wait until I become more familiar with the specifics of different platforms that we support. Just throwing this idea out there to gestate for now.
Also, this is a better approach if we ever get more platforms with realtime capabilities (e.g., with webhooks for every event that happens on the platform), or if we implement pseudo-realtime capabilities ourselves (e.g., "fetch metaculus frontpage ordered by activity, and refetch only the questions which we haven't seen yet").