sx.el icon indicating copy to clipboard operation
sx.el copied to clipboard

Implement a global database where every question is stored upon being fetched

Open Malabarba opened this issue 9 years ago • 1 comments

This is something I've chewing on for a while. It's related to all the cache stuff discussed here as well, but it's not a direct consequence of that. In particular, this data will be large so I think it should not be saved to cache.

The issue

  1. When we display the frontpage, viewing a question is very fast because all the question data is stored by the question list, so there's no need for another request. OTOH, when we view a question from the inbox or from a link, we always perform a new API request, even though we may already have that question data stored somewhere (in an open buffer or in one of the list tabs).
  2. When the user does a write operation (commenting, answering, editing, voting, etc), we alter the properties of the question object (stored in sx-question-mode--data) and then refresh the buffer. This works fine as a way to immediately display what has changed, and it even updates the value stored in the question list (because they are the same list object). However, this question's data could be stored redundantly in many different places (if the user has multiple tabs open, or if the question was viewed from the inbox or from a link instead of from a tab), and these won't be updated.

The solution

A viable solution is to have a database like:

((site . #[hashmap (question_id question-object)
                   (question_id question-object)
                   (question_id question-object))
 (site . #[hashmap (question_id question-object)
                   (question_id question-object)
                   (question_id question-object))
 (site . #[hashmap (question_id question-object)
                   (question_id question-object)
                   (question_id question-object)))

(maybe we can just use a vector instead of hashmap, since the key is an integer, but I worry this vector would have to be HUGE for a site like SO).

  1. This structure is populated whenever sx-question-get-questions or sx-question-get-question are used.
  2. The latter (sx-question-get-question) is adapted to take another argument, an integer DATE, and only performs an API request if the requested question_id is not present in the database, or if the version contained is older than DATE.
  3. sx-display is simplified to not care about the full data under point (title, body_markdown, etc). Instead, it only checks the site, question_id, and last_edit values under point, and uses sx-question-get-question to look these up in the database (and fetch if necessary, but it won't be necessary if we're inside a tab).
  4. Similarly, sx-question-list--print-info also shouldn't use the full data under point. Instead, it only checks the question_id and looks it up in the database. Since this will be done approx 100 times per page, and always for the same site, it might be good if sx-question-list-mode stores the site's hashmap (or vector) in a local variable for faster lookup.

All of the above should ensure we only ever use one copy of each object. This will reduce our unnecessary API queries and keep multible buffers in sync.

Malabarba avatar Jan 18 '15 14:01 Malabarba

I toyed with this idea in the very first prototypes of SX (the git repo I sent you a while back should have a history of that), but here's the gist: This data structure would get very large as time goes on. However, with what I know about elisp these days, this shouldn't be too big an issue. (We can invalidate the data structure every day at some customizable time, for instance.) I have a few random thoughts that I'd like to get down before moving help arrives:


Storing it as a vector isn't a good idea. If you run a search, for instance, you could retrieve wildly differing numbers as vector indices. Even if we were smart about it and started with 0 = lowest-index, we could still easily have a difference of >10000 for a specific search on the very first page of results. I'd conjecture the 'worst case' for a single day's use would be about 1% of SO's question database, but that's still nearly three hundred thousand questions. Keeping it as a hashmap just makes more sense to me. (Again, this is one of the structures I didn't know elisp had a year ago.)


I've thought about separating the data from the getters/setters in the caching system, so this idea could easily be integrated into that paradigm. Having different 'caches' would just amount to different accessors of the underlying data. (Obviously, the terminology would need to be changed, since the proper 'cache' is the data itself.)


All of your ideas are good ones. The ability to just return the key information needed to retrieve the data from the database is a great bonus to this idea.


Hopefully I'll have more time for further input next week, but you can expect the caching branch to incorporate a framework for these ideas.

vermiculus avatar Jan 18 '15 15:01 vermiculus