Cytomine-core Optimize project activity storage

Optimize project activity storage

Open urubens opened this issue 2 years ago • 5 comments

Currently the endpoint api/project.json with the parameter withLastActivity needs to make a join with a group by on the command_history table. On Cytomine instances with large data sets, with many history, this join becomes very very slow.

However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.

The drawback of this solution is that it uses a SQL trigger on command_history, which probably has a negative effect on all add/edit/delete commands, but is a not been measured.

In all cases, we need to find a solution to solve this performance issue, because current implementation makes the webUI very unresponsive and unuasable.

A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.

May 11 '22 09:05 urubens

@loic911 : I didn't collect this domain because I didn't like the "postgresql cache of mongo data" solution... and as I didn't have the perf problem on our server, it was not a priority to me.

I ping you because it can be an addition to your current reflexion about command, triggers and optimization

May 11 '22 12:05 geektortoise

This is not a cache of MongoDB data. The commands & command_history table are stored in PostgreSQL database.

May 11 '22 12:05 urubens

Yes ! Sorry, I mixed the concepts.

May 11 '22 13:05 geektortoise

But that's true that command history could be a good candidate to be stored in a nosql database (but I don't have the whole command system in mind, so maybe it is not the case). However, it then would raise the question on how to efficiently join data from SQL DB and data from noSQL DB :/

May 11 '22 13:05 urubens

However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.

It seems to be the best solutions but This will indeed probably slow down add/edit/delete.

I see a potential probblem here (to be confirmed/test): The transaction does not lock the last_activity row. This means that if 2 requests are run in // on the same project:

You may have two INSERT for a single project
You may have issues with UPDATE because the two transactions will do the update at the same time.

A possible way to test this is to run the annotation benchmark written by Ba Thien as it insert lots of annotations on multiple threads.

A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.

I don't think so, I think a view is only a way to "encapsulate" a request. So you will simply replace this from += "LEFT OUTER JOIN " + "( SELECT project_id, MAX(created) max_date " + " FROM command_history " + " GROUP BY project_id " + ") activities ON p.id = activities.project_id

By something like that: from += "LEFT OUTER JOIN last_activity_view ON p.id = activities.project_id Not a performance improvement but a "readability/less redundancy" improvement.

Another possibility is to avoid the use of triggers and to keep in memory a structure that maps project and the last modification date and to sync it frequently (let says every min) in the database. BTW that's the only solutions I see to remove trigger for (user/algo/reviewed)annotations count.

May 11 '22 13:05 loic911

Cytomine-core Cytomine-core copied to clipboard

Optimize project activity storage

Cytomine-core
Cytomine-core copied to clipboard