Cytomine-core icon indicating copy to clipboard operation
Cytomine-core copied to clipboard

Optimize project activity storage

Open urubens opened this issue 2 years ago • 5 comments

Currently the endpoint api/project.json with the parameter withLastActivity needs to make a join with a group by on the command_history table. On Cytomine instances with large data sets, with many history, this join becomes very very slow.

However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.

The drawback of this solution is that it uses a SQL trigger on command_history, which probably has a negative effect on all add/edit/delete commands, but is a not been measured.

In all cases, we need to find a solution to solve this performance issue, because current implementation makes the webUI very unresponsive and unuasable.

A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.

urubens avatar May 11 '22 09:05 urubens

@loic911 : I didn't collect this domain because I didn't like the "postgresql cache of mongo data" solution... and as I didn't have the perf problem on our server, it was not a priority to me.

I ping you because it can be an addition to your current reflexion about command, triggers and optimization

geektortoise avatar May 11 '22 12:05 geektortoise

This is not a cache of MongoDB data. The commands & command_history table are stored in PostgreSQL database.

urubens avatar May 11 '22 12:05 urubens

Yes ! Sorry, I mixed the concepts.

geektortoise avatar May 11 '22 13:05 geektortoise

But that's true that command history could be a good candidate to be stored in a nosql database (but I don't have the whole command system in mind, so maybe it is not the case). However, it then would raise the question on how to efficiently join data from SQL DB and data from noSQL DB :/

urubens avatar May 11 '22 13:05 urubens

However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.

It seems to be the best solutions but This will indeed probably slow down add/edit/delete.

I see a potential probblem here (to be confirmed/test): The transaction does not lock the last_activity row. This means that if 2 requests are run in // on the same project:

  • You may have two INSERT for a single project
  • You may have issues with UPDATE because the two transactions will do the update at the same time.

A possible way to test this is to run the annotation benchmark written by Ba Thien as it insert lots of annotations on multiple threads.

A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.

I don't think so, I think a view is only a way to "encapsulate" a request. So you will simply replace this from += "LEFT OUTER JOIN " + "( SELECT project_id, MAX(created) max_date " + " FROM command_history " + " GROUP BY project_id " + ") activities ON p.id = activities.project_id

By something like that: from += "LEFT OUTER JOIN last_activity_view ON p.id = activities.project_id Not a performance improvement but a "readability/less redundancy" improvement.

Another possibility is to avoid the use of triggers and to keep in memory a structure that maps project and the last modification date and to sync it frequently (let says every min) in the database. BTW that's the only solutions I see to remove trigger for (user/algo/reviewed)annotations count.

loic911 avatar May 11 '22 13:05 loic911