openvsx
openvsx copied to clipboard
Do file resource processing via background job
Fixes #471
I did some Gatling testing to check if running FileResource
processing in the background makes a difference in terms of performance and request throughput. Please note that these tests were done on my local development environment. The Open VSX server and Gatling were running on the same machine (which can cause interference). To confirm these findings it would be a good idea to test against the staging server.
Below are the results when only 1 extension is published at a time. The current (OLD) publishing process performs a bit better in this case.
== OLD =========================================================================
---- Global Information --------------------------------------------------------
> request count 1137 (OK=661 KO=476 )
> min response time 3 (OK=23 KO=3 )
> max response time 10549 (OK=10549 KO=10029 )
> mean response time 208 (OK=323 KO=49 )
> std deviation 664 (OK=753 KO=470 )
> response time 50th percentile 40 (OK=75 KO=8 )
> response time 75th percentile 99 (OK=267 KO=15 )
> response time 95th percentile 970 (OK=1375 KO=51 )
> response time 99th percentile 3146 (OK=3529 KO=779 )
> mean requests/sec 4.075 (OK=2.369 KO=1.706 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 593 ( 52%)
> 800 ms < t < 1200 ms 25 ( 2%)
> t > 1200 ms 43 ( 4%)
> failed 476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400 476 (100,0%)
================================================================================
== NEW =========================================================================
---- Global Information --------------------------------------------------------
> request count 1137 (OK=661 KO=476 )
> min response time 3 (OK=16 KO=3 )
> max response time 10057 (OK=8724 KO=10057 )
> mean response time 234 (OK=353 KO=68 )
> std deviation 790 (OK=917 KO=527 )
> response time 50th percentile 33 (OK=64 KO=10 )
> response time 75th percentile 90 (OK=180 KO=19 )
> response time 95th percentile 1074 (OK=1882 KO=72 )
> response time 99th percentile 3958 (OK=4885 KO=1526 )
> mean requests/sec 3.598 (OK=2.092 KO=1.506 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 588 ( 52%)
> 800 ms < t < 1200 ms 27 ( 2%)
> t > 1200 ms 46 ( 4%)
> failed 476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400 476 (100,0%)
================================================================================
However, when 2 or more extensions are published at a time, the current (OLD) publishing process starts to throw exceptions and I decided to abort the test because it took forever.
With ovsx
now supporting target platforms, a single ./ovsx publish
command can perform up to 10 requests (one for each target platform) concurrently. This means the server has to be able to reliably handle multiple publish operations at the same time.
== OLD =========================================================================
2022-07-18 22:27:34 955s elapsed
---- Requests ------------------------------------------------------------------
> Global (OK=363 KO=278 )
> publishExtension (OK=363 KO=278 )
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400 270 (97,12%)
> i.g.h.c.i.RequestTimeoutException: Request timeout to localhos 8 ( 2,88%)
t/127.0.0.1:8080 after 180000 ms
---- RegistryAPI: Publish Extension --------------------------------------------
[--------------------------------------------------------------------------] 0%
waiting: 0 / active: 2 / done: 0
================================================================================
2022-07-18 22:26:08.220 WARN 55744 --- [ Timer-0] o.j.storage.AbstractStorageProvider : Error notifying JobStorageChangeListeners
org.jobrunr.storage.StorageException: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30007ms.
at org.jobrunr.storage.sql.common.DefaultSqlStorageProvider.getJobStats(DefaultSqlStorageProvider.java:346) ~[jobrunr-5.1.2.jar:5.1.2]
at org.jobrunr.storage.AbstractStorageProvider.notifyJobStatsOnChangeListeners(AbstractStorageProvider.java:73) ~[jobrunr-5.1.2.jar:5.1.2]
at org.jobrunr.storage.AbstractStorageProvider$NotifyOnChangeListeners.run(AbstractStorageProvider.java:177) ~[jobrunr-5.1.2.jar:5.1.2]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[na:na]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[na:na]
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30007ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695) ~[HikariCP-3.4.5.jar:na]
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197) ~[HikariCP-3.4.5.jar:na]
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162) ~[HikariCP-3.4.5.jar:na]
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-3.4.5.jar:na]
at org.jobrunr.storage.sql.common.DefaultSqlStorageProvider.getJobStats(DefaultSqlStorageProvider.java:343) ~[jobrunr-5.1.2.jar:5.1.2]
... 4 common frames omitted
== NEW =========================================================================
---- Global Information --------------------------------------------------------
> request count 1136 (OK=660 KO=476 )
> min response time 4 (OK=20 KO=4 )
> max response time 15753 (OK=15753 KO=3597 )
> mean response time 348 (OK=552 KO=65 )
> std deviation 1145 (OK=1450 KO=272 )
> response time 50th percentile 51 (OK=108 KO=16 )
> response time 75th percentile 159 (OK=343 KO=32 )
> response time 95th percentile 1606 (OK=2577 KO=158 )
> response time 99th percentile 5948 (OK=8473 KO=1622 )
> mean requests/sec 4.876 (OK=2.833 KO=2.043 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 572 ( 50%)
> 800 ms < t < 1200 ms 19 ( 2%)
> t > 1200 ms 69 ( 6%)
> failed 476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400 476 (100,0%)
================================================================================
And it gets even better when publishing 4 extensions at a time:
================================================================================
---- Global Information --------------------------------------------------------
> request count 1136 (OK=660 KO=476 )
> min response time 4 (OK=18 KO=4 )
> max response time 14156 (OK=14156 KO=3215 )
> mean response time 463 (OK=736 KO=83 )
> std deviation 1427 (OK=1803 KO=322 )
> response time 50th percentile 62 (OK=140 KO=17 )
> response time 75th percentile 238 (OK=521 KO=33 )
> response time 95th percentile 2042 (OK=3613 KO=230 )
> response time 99th percentile 8461 (OK=10606 KO=2158 )
> mean requests/sec 6.927 (OK=4.024 KO=2.902 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 536 ( 47%)
> 800 ms < t < 1200 ms 43 ( 4%)
> t > 1200 ms 81 ( 7%)
> failed 476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400 476 (100,0%)
================================================================================
@amvanbaren would this PR help with the errors we've had while trying to deploy new openvsx to prod?
@jeanp413 No, not directly, but it can improve the overall stability of the server. It does help to make the time it takes to publish an extension version less variable, i.e. large extension packages take about the same time to be published as small extension packages. This is aimed at fixing #471.
I've been working on the deployment issues at https://github.com/amvanbaren/openvsx/tree/bugfix/extension-versions-transactional. I'd like to test these improvements on staging to confirm that they will improve performance in production. You can follow progress at https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1483
@amvanbaren Is the purpouse of this PR to completety fix HikariPool-1 - Connection is not available, request timed out after 30000ms.
when too many extensions are published or just improve the current behavior, if the latter then I'll approve :slightly_smiling_face:
@akosyakov should we cherry-pick this to our fork?
@jeanp413
Is the purpouse of this PR to completety fix HikariPool-1 - Connection is not available, request timed out after 30000ms.
No, not completely. I've only tested publishing up to 4 extensions at the same time, not 30.
The question is also how this will run on staging
or production
.
@akosyakov should we cherry-pick this to our fork?
Not yet, our fork to learn and not to resolve anything else. it is good to know that we have an option to optimise data sync in future.
@amvanbaren We observed connection is not available
error while testing the mirror mode. Can we land this PR to try? How can we ensure that synching of all extensions in parallel does not degrade user experience? We need to make sure that users requests have higher priority and don't break.
Actually @iQQbot shared that there was only one job.
We need to make sure that users requests have higher priority and don't break.
That's a bit too high level. Can you further explain what you mean by this?
Actually @iQQBot shared that there was only one job.
There's one type of job that extracts all file resources and uploads them to cloud storage. At runtime there can be multiple jobs queued up to extract the file resources. Especially when the server is running in mirror mode, then you can expect a couple hundred jobs being queued up.
That's a bit too high level. Can you further explain what you mean by this?
I synced yesterday with Pudong about it. Nobody was using our SH installation but we had 4 instances running with one sync job. One instance got stuck for some reason. I will hit it while using web ui and it always hangs on connections to DB. Connections were timing out. Pudong looked at DB and there were no any connections or requests at all. So we restarted an instance and it was alright afterwards. It got us thinking that maybe there is some kind of connection leak in the server or something like that. cc @iQQbot correct me if I misunderstood something
That's a bit too high level. Can you further explain what you mean by this?
I synced yesterday with Pudong about it. Nobody was using our SH installation but we had 4 instances running with one sync job. One instance got stuck for some reason. I will hit it while using web ui and it always hangs on connections to DB. Connections were timing out. Pudong looked at DB and there were no any connections or requests at all. So we restarted an instance and it was alright afterwards. It got us thinking that maybe there is some kind of connection leak in the server or something like that. cc @iQQBot correct me if I misunderstood something
Ok, and you think this PR might be able to resolve this connection leak?
Can we land this PR to try?
Yes, let's try it out.
Ok, and you think this PR might be able to resolve this connection leak?
@amvanbaren sorry, in the original comment i was thinking it was load problem of too many syncs, but it is indeed seem to be something else. I've seen that you added missing transactional annotations PR. I will rebase our branch and try your latest changes.