xk6-browser Beta support for xk6-browser in the cloud

As discussed internally, several things need to happen to move this forward:

[x] A new AWS AMI needs to be created that bundles k6+xk6-browser and Chromium.
[x] Testing should be done to determine the minimum system resources (EC2 instance type) needed for browser tests. [See test results below.]
[ ] A secure execution model should be implemented that ensures proper isolation of test data between test runs, as well as a way of setting resource usage quotas, such as disk space. E.g. consider running in a container. [See internal discussion. This won't be available for the initial private beta, but should be done for the public beta.]
[ ] Deploy the AMI, create a new k6 release that uses it, and assign it to internal users.
[ ] Internal testing should be done to ensure the system is relatively stable before opening it up to limited customer use.

Apr 12 '22 09:04 imiric

I ran some tests on a t3.large (2 vCPU, 8GB RAM) EC2 instance in the dev environment using the ami-0bee0b325707adf45 image (with xk6-browser 021fb2f, Google Chrome stable 102.0.5005.61) in order to gather some performance statistics, mainly RAM usage.

I ran the following 3 scripts in order to get an idea how memory usage scales with sites of different complexity:

k6-io.js: Loads https://k6.io/, scrolls down the page, clicks on a JS docs link and scrolls again.
test-k6-io.js: Essentially the same as fillform.js, just with slowMo: '1s'.
blank.js: Loads about:blank and sleeps for 60s.

All 3 scripts were run with k6 run <script> and k6 run -i 10 -u 10 <script>, so there were 6 total runs.

To gather the memory statistics, I initially wrote mem.sh, but this turned out to not take into account shared memory usage, so the numbers were much larger than actual physical memory usage.

Then I stumbled upon smem, which does take shared memory usage into account (PSS) and gives a more realistic result, so I used the script mem2.sh to gather the numbers below.

All scripts take approximately 1m to run, and since smem takes a while (~30s!) to compute (Python...), depending on the amount of processes, I ran mem2.sh twice at t=1s and 30s later.

The PSS values are in KiB. The Chrome column is a summary of all Chrome processes (including subprocesses). The time is the approximate time since test start when the script returned results.

Script	VUs	Time	k6 PSS	Chrome PSS	Total PSS
`k6-io.js`	1	2s	77452	320333	397785
	1	32s	89700	310763	400463
	10	43s	82484	2230585	2313069
	10	1m11s	128200	1626171	1754371
`test-k6-io.js`	1	2s	73324	170923	244247
	1	33s	73660	192115	265775
	10	4s	78696	950836	1029532
	10	48s	84732	837489	922221
`blank.js`	1	1s	80628	167470	248098
	1	34s	81212	166766	247978
	10	4s	83500	730164	813664
	10	35s	83564	723724	807288

Conclusions

To analyze the results, it would've been nice to visualize the numbers in a spreadsheet graph, which I initially planned to do, but since smem takes so long to run, there would be few values to make it interesting. So I'll just mention some interesting points below. :)

It's quite apparent that Chrome memory usage can vary wildly between sites, depending on their complexity. Compare the 2.2GB of the k6-io.js script with 10 VUs, with the 950MB of test-k6-io.js. Even the 1 VU test runs show an almost 2x memory usage for the k6-io.js script.
Because Chrome processes share memory, the scaling between 1 vs 10 VUs is less than 10x.
I didn't report results for >10 VUs, but I did try running 50 and 100 VUs. With 50, the memory usage for the k6-io.js script remained relatively stable at around 5GB, but both CPU cores remained pegged at 100% during the initial half of the test, which would certainly cause issues with browser performance and the script itself, so this is not advisable on t3.large. With 100 VUs, the 8GB of RAM were quickly exhaused, the CPU cores were constant at 100%, and the system became unresponsive (it needed to be force-stopped :facepalm:). It's worth noting that this instance didn't have any swap enabled, but we likely wouldn't want to rely on it anyway. Needless to say, 100 VUs on t3.large for complex sites is not doable. :) Depending on the site and complexity of the script, I'd say the reasonable limit on t3.large is around 30 VUs, and possibly 50 for instances with more vCPUs (e.g. t3.xlarge).

Jun 03 '22 16:06 imiric

@inancgumus @ankur22 @sniku See the AWS performance test results above, and let me know if you have ideas for other tests we should do (on larger instances?).

Jun 03 '22 16:06 imiric

@imiric Some ideas:

Testing a website that plays a video and another that loads many large images/assets to see the impact on the instances.
Testing a website with dozens of pages and rerunning the test repeatedly to see whether we're leaking memory. For example, test and see what happens when you run the Vistaprint script (maybe this one).

IDK. Just shooting some ideas. Since you're trying to determine the minimum system resources these ideas are probably not that useful.

So these ideas might make more sense to have for the "Internal testing should be done to ensure the system is relatively stable before opening it up to limited customer use" task.

Jun 08 '22 21:06 inancgumus

Yeah, both tests would be useful. The video and large assets one is probably atypical for the kinds of tests we would see, but it would give us an idea on the upper bound we can expect, similar to running with 50 or 100 VUs. And the Vistaprint one would be a more real world scenario we can expect for complex sites, and would also stress test the system.

I kind of didn't want to test with 3rd party sites, and https://k6.io/ and our JS docs are quite heavy as well, though the script could probably be improved.

So these ideas might make more sense to have for the "Internal testing should be done to ensure the system is relatively stable before opening it up to limited customer use" task.

Agreed. I think we can run these and other tests once we have an internal Cloud deployment done, and then we can decide how to scale from there. Things will likely need lots of adjustments before the private beta.

Jun 09 '22 08:06 imiric

Looks good to me.

Is there a reason why tests were performed only on a t3.large instance? I guess we don't know how many VUs our users will want.

I think it would be useful to chart the theoretical 30 VU limit.

I don't quite understand how one test ran with 1 VU took 2s, and another test with 1 VU took 32s, is the second test with 1 VU running with 10 iterations?

I've plotted this in spreadsheets in the hopes it would help identify anything: https://docs.google.com/spreadsheets/d/1dJhTk_3hCGN6Gg-j4

Jun 09 '22 17:06 ankur22

Is there a reason why tests were performed only on a t3.large instance?

No particular reason, other than picking a reasonably spec'd one. t3.xlarge would've likely performed better, but we use t3.large in Cloud tests as well. All the *.micro instances would lack the memory for browser tests, so I think 8GB is the minimum.

Oh, if you mean why only on t3.large and not on others... well, to simplify things. We wanted to get a rough idea of the resource utilization, but we can run tests on other instance types later.

I think it would be useful to chart the theoretical 30 VU limit.

:+1: Now that we have an idea of what to expect, I think we can do this once we deploy the AMI to staging. Although, as these results show, picking a maximum VU number is kind of meaningless without knowing the type of site and script that's running. 30 VUs testing k6.io is much more resource intensive than 30 VUs testing test.k6.io. It's the same problem that plain k6 has, but on a larger scale, since there are browser processes involved, which consume a lot of resources.

I don't quite understand how one test ran with 1 VU took 2s, and another test with 1 VU took 32s, is the second test with 1 VU running with 10 iterations?

Sorry, maybe I didn't explain it properly, and couldn't express it well with GH's Markdown. The Time column is the approximate time since test start when the mem2.sh script returned results, not the overall test duration.

I ran the mem2.sh script twice for each test run: at ~1s, and ~30s after that. These are the first two rows, and why you see VUs repeated. So the first four rows represent two test runs of k6-io.js, and 4 samples of mem2.sh. Since smem takes much longer to compute PSS when there are more processes, this is why you see the first sample of the 10 VU test runs return much later than the 1 VU runs (e.g. 43s vs 2s). Hope that clears it up.

Thanks for the spreadsheet, ~but I'm getting a message that it doesn't exist (while logged in)~. Got it, looks good, but since there are so few data points, the graphs aren't that useful IMO.

Jun 10 '22 08:06 imiric

@sniku The large tasks for this feature are mostly done, but I wanted to confirm with you when the "secure execution" (point 3 in the description) should be implemented. I'd say we're still a couple of weeks of R&D away from making that happen, so how critical is it for the initial beta?

To be clear, when I talk about the "beta", I have a couple of phases in mind:

The first, initial, (private) beta release will be for our internal use and for specific customers only.
Once that is relatively stable and we feel we're ready, we can open it up for all customers, as a public beta.

The secure execution model should definitely be done before the public beta is released, but I'm wondering whether it should also be in place for the private one. Or maybe before we allow any external customers to use it? From the talk today it sounded like you were OK with delaying this until the public beta, but let us know.

EDIT: As discussed in today's (2022-06-16) sync, a secure execution environment is not a requirement for the initial (private) beta, and can wait for the public beta. We'll address initial security concerns in a different way.

Jun 15 '22 15:06 imiric

xk6-browser xk6-browser copied to clipboard

Beta support for xk6-browser in the cloud

Conclusions

xk6-browser
xk6-browser copied to clipboard