openvsx icon indicating copy to clipboard operation
openvsx copied to clipboard

Do file resource processing via background job

Open amvanbaren opened this issue 2 years ago • 4 comments

Fixes #471

I did some Gatling testing to check if running FileResource processing in the background makes a difference in terms of performance and request throughput. Please note that these tests were done on my local development environment. The Open VSX server and Gatling were running on the same machine (which can cause interference). To confirm these findings it would be a good idea to test against the staging server.

Below are the results when only 1 extension is published at a time. The current (OLD) publishing process performs a bit better in this case.

== OLD =========================================================================
---- Global Information --------------------------------------------------------
> request count                                       1137 (OK=661    KO=476   )
> min response time                                      3 (OK=23     KO=3     )
> max response time                                  10549 (OK=10549  KO=10029 )
> mean response time                                   208 (OK=323    KO=49    )
> std deviation                                        664 (OK=753    KO=470   )
> response time 50th percentile                         40 (OK=75     KO=8     )
> response time 75th percentile                         99 (OK=267    KO=15    )
> response time 95th percentile                        970 (OK=1375   KO=51    )
> response time 99th percentile                       3146 (OK=3529   KO=779   )
> mean requests/sec                                  4.075 (OK=2.369  KO=1.706 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                           593 ( 52%)
> 800 ms < t < 1200 ms                                  25 (  2%)
> t > 1200 ms                                           43 (  4%)
> failed                                               476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400                       476 (100,0%)
================================================================================
== NEW =========================================================================
---- Global Information --------------------------------------------------------
> request count                                       1137 (OK=661    KO=476   )
> min response time                                      3 (OK=16     KO=3     )
> max response time                                  10057 (OK=8724   KO=10057 )
> mean response time                                   234 (OK=353    KO=68    )
> std deviation                                        790 (OK=917    KO=527   )
> response time 50th percentile                         33 (OK=64     KO=10    )
> response time 75th percentile                         90 (OK=180    KO=19    )
> response time 95th percentile                       1074 (OK=1882   KO=72    )
> response time 99th percentile                       3958 (OK=4885   KO=1526  )
> mean requests/sec                                  3.598 (OK=2.092  KO=1.506 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                           588 ( 52%)
> 800 ms < t < 1200 ms                                  27 (  2%)
> t > 1200 ms                                           46 (  4%)
> failed                                               476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400                       476 (100,0%)
================================================================================

However, when 2 or more extensions are published at a time, the current (OLD) publishing process starts to throw exceptions and I decided to abort the test because it took forever. With ovsx now supporting target platforms, a single ./ovsx publish command can perform up to 10 requests (one for each target platform) concurrently. This means the server has to be able to reliably handle multiple publish operations at the same time.

== OLD =========================================================================
2022-07-18 22:27:34                                         955s elapsed
---- Requests ------------------------------------------------------------------
> Global                                                   (OK=363    KO=278   )
> publishExtension                                         (OK=363    KO=278   )
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400                       270 (97,12%)
> i.g.h.c.i.RequestTimeoutException: Request timeout to localhos      8 ( 2,88%)
t/127.0.0.1:8080 after 180000 ms

---- RegistryAPI: Publish Extension --------------------------------------------
[--------------------------------------------------------------------------]  0%
          waiting: 0      / active: 2      / done: 0
================================================================================
2022-07-18 22:26:08.220  WARN 55744 --- [        Timer-0] o.j.storage.AbstractStorageProvider      : Error notifying JobStorageChangeListeners

org.jobrunr.storage.StorageException: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30007ms.
        at org.jobrunr.storage.sql.common.DefaultSqlStorageProvider.getJobStats(DefaultSqlStorageProvider.java:346) ~[jobrunr-5.1.2.jar:5.1.2]
        at org.jobrunr.storage.AbstractStorageProvider.notifyJobStatsOnChangeListeners(AbstractStorageProvider.java:73) ~[jobrunr-5.1.2.jar:5.1.2]
        at org.jobrunr.storage.AbstractStorageProvider$NotifyOnChangeListeners.run(AbstractStorageProvider.java:177) ~[jobrunr-5.1.2.jar:5.1.2]
        at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[na:na]
        at java.base/java.util.TimerThread.run(Timer.java:506) ~[na:na]
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30007ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695) ~[HikariCP-3.4.5.jar:na]
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197) ~[HikariCP-3.4.5.jar:na]
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162) ~[HikariCP-3.4.5.jar:na]
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-3.4.5.jar:na]
        at org.jobrunr.storage.sql.common.DefaultSqlStorageProvider.getJobStats(DefaultSqlStorageProvider.java:343) ~[jobrunr-5.1.2.jar:5.1.2]
        ... 4 common frames omitted
== NEW =========================================================================
---- Global Information --------------------------------------------------------
> request count                                       1136 (OK=660    KO=476   )
> min response time                                      4 (OK=20     KO=4     )
> max response time                                  15753 (OK=15753  KO=3597  )
> mean response time                                   348 (OK=552    KO=65    )
> std deviation                                       1145 (OK=1450   KO=272   )
> response time 50th percentile                         51 (OK=108    KO=16    )
> response time 75th percentile                        159 (OK=343    KO=32    )
> response time 95th percentile                       1606 (OK=2577   KO=158   )
> response time 99th percentile                       5948 (OK=8473   KO=1622  )
> mean requests/sec                                  4.876 (OK=2.833  KO=2.043 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                           572 ( 50%)
> 800 ms < t < 1200 ms                                  19 (  2%)
> t > 1200 ms                                           69 (  6%)
> failed                                               476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400                       476 (100,0%)
================================================================================

And it gets even better when publishing 4 extensions at a time:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1136 (OK=660    KO=476   )
> min response time                                      4 (OK=18     KO=4     )
> max response time                                  14156 (OK=14156  KO=3215  )
> mean response time                                   463 (OK=736    KO=83    )
> std deviation                                       1427 (OK=1803   KO=322   )
> response time 50th percentile                         62 (OK=140    KO=17    )
> response time 75th percentile                        238 (OK=521    KO=33    )
> response time 95th percentile                       2042 (OK=3613   KO=230   )
> response time 99th percentile                       8461 (OK=10606  KO=2158  )
> mean requests/sec                                  6.927 (OK=4.024  KO=2.902 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                           536 ( 47%)
> 800 ms < t < 1200 ms                                  43 (  4%)
> t > 1200 ms                                           81 (  7%)
> failed                                               476 ( 42%)
---- Errors --------------------------------------------------------------------
> status.find.is(201), but actually found 400                       476 (100,0%)
================================================================================

amvanbaren avatar Jul 19 '22 09:07 amvanbaren

@amvanbaren would this PR help with the errors we've had while trying to deploy new openvsx to prod?

jeanp413 avatar Aug 12 '22 04:08 jeanp413

@jeanp413 No, not directly, but it can improve the overall stability of the server. It does help to make the time it takes to publish an extension version less variable, i.e. large extension packages take about the same time to be published as small extension packages. This is aimed at fixing #471.

I've been working on the deployment issues at https://github.com/amvanbaren/openvsx/tree/bugfix/extension-versions-transactional. I'd like to test these improvements on staging to confirm that they will improve performance in production. You can follow progress at https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1483

amvanbaren avatar Aug 12 '22 11:08 amvanbaren

@amvanbaren Is the purpouse of this PR to completety fix HikariPool-1 - Connection is not available, request timed out after 30000ms. when too many extensions are published or just improve the current behavior, if the latter then I'll approve :slightly_smiling_face:

@akosyakov should we cherry-pick this to our fork?

jeanp413 avatar Aug 12 '22 15:08 jeanp413

@jeanp413

Is the purpouse of this PR to completety fix HikariPool-1 - Connection is not available, request timed out after 30000ms.

No, not completely. I've only tested publishing up to 4 extensions at the same time, not 30. The question is also how this will run on staging or production.

amvanbaren avatar Aug 12 '22 15:08 amvanbaren

@akosyakov should we cherry-pick this to our fork?

Not yet, our fork to learn and not to resolve anything else. it is good to know that we have an option to optimise data sync in future.

akosyakov avatar Aug 15 '22 12:08 akosyakov

@amvanbaren We observed connection is not available error while testing the mirror mode. Can we land this PR to try? How can we ensure that synching of all extensions in parallel does not degrade user experience? We need to make sure that users requests have higher priority and don't break.

akosyakov avatar Sep 05 '22 08:09 akosyakov

Actually @iQQbot shared that there was only one job.

akosyakov avatar Sep 05 '22 08:09 akosyakov

We need to make sure that users requests have higher priority and don't break.

That's a bit too high level. Can you further explain what you mean by this?

Actually @iQQBot shared that there was only one job.

There's one type of job that extracts all file resources and uploads them to cloud storage. At runtime there can be multiple jobs queued up to extract the file resources. Especially when the server is running in mirror mode, then you can expect a couple hundred jobs being queued up.

amvanbaren avatar Sep 05 '22 11:09 amvanbaren

That's a bit too high level. Can you further explain what you mean by this?

I synced yesterday with Pudong about it. Nobody was using our SH installation but we had 4 instances running with one sync job. One instance got stuck for some reason. I will hit it while using web ui and it always hangs on connections to DB. Connections were timing out. Pudong looked at DB and there were no any connections or requests at all. So we restarted an instance and it was alright afterwards. It got us thinking that maybe there is some kind of connection leak in the server or something like that. cc @iQQbot correct me if I misunderstood something

akosyakov avatar Sep 06 '22 07:09 akosyakov

That's a bit too high level. Can you further explain what you mean by this?

I synced yesterday with Pudong about it. Nobody was using our SH installation but we had 4 instances running with one sync job. One instance got stuck for some reason. I will hit it while using web ui and it always hangs on connections to DB. Connections were timing out. Pudong looked at DB and there were no any connections or requests at all. So we restarted an instance and it was alright afterwards. It got us thinking that maybe there is some kind of connection leak in the server or something like that. cc @iQQBot correct me if I misunderstood something

Ok, and you think this PR might be able to resolve this connection leak?

amvanbaren avatar Sep 06 '22 07:09 amvanbaren

Can we land this PR to try?

Yes, let's try it out.

amvanbaren avatar Sep 06 '22 17:09 amvanbaren

Ok, and you think this PR might be able to resolve this connection leak?

@amvanbaren sorry, in the original comment i was thinking it was load problem of too many syncs, but it is indeed seem to be something else. I've seen that you added missing transactional annotations PR. I will rebase our branch and try your latest changes.

akosyakov avatar Sep 08 '22 08:09 akosyakov