Cytomine-core
Cytomine-core copied to clipboard
Optimize project activity storage
Currently the endpoint api/project.json
with the parameter withLastActivity
needs to make a join with a group by on the command_history
table. On Cytomine instances with large data sets, with many history, this join becomes very very slow.
However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.
The drawback of this solution is that it uses a SQL trigger on command_history
, which probably has a negative effect on all add/edit/delete commands, but is a not been measured.
In all cases, we need to find a solution to solve this performance issue, because current implementation makes the webUI very unresponsive and unuasable.
A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.
@loic911 : I didn't collect this domain because I didn't like the "postgresql cache of mongo data" solution... and as I didn't have the perf problem on our server, it was not a priority to me.
I ping you because it can be an addition to your current reflexion about command, triggers and optimization
This is not a cache of MongoDB data. The commands & command_history
table are stored in PostgreSQL database.
Yes ! Sorry, I mixed the concepts.
But that's true that command history could be a good candidate to be stored in a nosql database (but I don't have the whole command system in mind, so maybe it is not the case). However, it then would raise the question on how to efficiently join data from SQL DB and data from noSQL DB :/
However, this request is very frequently used in the webUI to list the project by last activity (default behavior). This PR proposes a other solution, with an intermediary table to store the last project activity. It is much faster than before.
It seems to be the best solutions but This will indeed probably slow down add/edit/delete.
I see a potential probblem here (to be confirmed/test): The transaction does not lock the last_activity row. This means that if 2 requests are run in // on the same project:
- You may have two INSERT for a single project
- You may have issues with UPDATE because the two transactions will do the update at the same time.
A possible way to test this is to run the annotation benchmark written by Ba Thien as it insert lots of annotations on multiple threads.
A possible alternative maybe to study could be the usage of a PostgreSQL view ? I don't know how this is managed internally and if it's more efficient than a trigger.
I don't think so, I think a view is only a way to "encapsulate" a request. So you will simply replace this
from += "LEFT OUTER JOIN " + "( SELECT project_id, MAX(created) max_date " + " FROM command_history " + " GROUP BY project_id " + ") activities ON p.id = activities.project_id
By something like that:
from += "LEFT OUTER JOIN last_activity_view ON p.id = activities.project_id
Not a performance improvement but a "readability/less redundancy" improvement.
Another possibility is to avoid the use of triggers and to keep in memory a structure that maps project and the last modification date and to sync it frequently (let says every min) in the database. BTW that's the only solutions I see to remove trigger for (user/algo/reviewed)annotations count.