k6 icon indicating copy to clipboard operation
k6 copied to clipboard

Improving handling of large files in k6

Open oleiade opened this issue 1 year ago β€’ 11 comments

Story

Problem Statement

Handling large files in k6, whether binary or structured format files such as CSV, leads to high memory usage. As a result, our user’s experience, especially in the cloud, is degraded as soon as they need to handle large data sets; of user ids, for instance.

This issue is experienced by our users in various situations and is caused by many design and implementation decisions in the current state of the k6 open-source tool.

Objectives

Product-oriented

The product-oriented objective of this story, and definition of success, will consist in landing a new streaming CSV parser in k6, allowing users to parse and use big CSV files (> 500MB) that wouldn’t fit in memory and likely lead their k6 scripts to crash. We are keen to provide any technical improvements to k6 that make it possible along the way.

Technology Oriented

From a technological standpoint, the objective of this story is to provide all the design and technical changes necessary to complete the product story. Our primary objective should be to support the ability of users to use large data set files in their k6 scripts without leading to out-of-memory errors. As we pursue this goal, we aim to pick the solutions with the most negligible overhead and are keen to take on technical debt if necessary.

Resolution

Through internal workshops with @sniku and @mstoykov, we surfaced various topics and issues that must be addressed to fulfill the objective.

  • [ ] https://github.com/grafana/k6/issues/2976: our product-level end goal

Must-have

The bare minimum must-have items to be able even to start tackling the top-level product objective would be:

  1. [ ] https://github.com/grafana/k6/issues/2975: We end up caching files in memory because we cannot read them directly from a tar archive without decompressing them first.
  2. [ ] Reduce the caching of files inside k6. As a result, πŸ‘† k6 caching files users use in memory and duplicates them per-vu. The behavior is all over the place. If a more convenient tar library allowing for direct access to files from archives were to be available, we might want to revisit this behavior.

Nice to have

As we're at it, some other set of features and refactors would be beneficial to the larger story of handling large files in k6

  1. [x] https://github.com/grafana/k6/issues/2977. Currently, the open() method of k6 is somewhat misnamed and performs a readFile() operation. This is also a result of k6 archiving users' content in a single tar archive and having to access resources through it. With more efficient and flexible access to k6's tar archive's content, we believe that k6 would also benefit from having a more "standard" file API to open, read and seek through files more conveniently. This would help support streaming use cases by providing more flexible navigation in a file's content. While also helping benefit from OS-based optimizations like the buffer cache.
  2. [x] https://github.com/grafana/k6/issues/2978 Another key aspect of handling files more efficiently in k6 is how we access them. As illustrated πŸ‘† we currently only have a way to load the whole content of a file in memory. To support the specific product goal, and other endeavors such as #2273 , or our work towards a new HTTP API, we believe adding even a partial (read operations only) support for the Streams API to k6 would be beneficial. It would establish a healthy baseline API for streaming IO in k6.

Problem Space

This issue approaches the problem at hand with a pragmatic product-oriented objective. However, this specific set of issues has already been approached from various angles in the past and is connected to longer-term plans as demonstrated in this list:

  • https://github.com/grafana/k6/issues/2311
  • https://github.com/grafana/k6/pull/2971
  • #1539
  • #1079

oleiade avatar Mar 13 '23 10:03 oleiade

While thinking about https://github.com/grafana/k6/issues/2975#issuecomment-1495761640, I realized that I probably disagree with something here. I think the first points of the "Must-have" and "Nice to have" sections are somewhat flipped and need to be exchanged :sweat_smile: That is, https://github.com/grafana/k6/issues/2975 is probably nice to have, while https://github.com/grafana/k6/issues/2977 seems like a must.

What will happen if we only implement https://github.com/grafana/k6/issues/2975? We'll have a very efficient archive bundle format (.tar or otherwise) that doesn't load everything into memory when k6 executes it. That would be awesome! It would mean users would be able to potentially cram huge static files in these archives. However, they would have no way to actually use these files in their scripts besides using open(). If it's data (and not, say, HTTP request bodies), maybe they can also use a SharedArray, which will still make at least two (one temporary and one permanent, with extra JSON overhead) copies of the whole contents in memory... :sweat_smile:

Whereas, if we don't touch .tar archives and still load their contents fully in memory, but we stop copying all of the file contents everywhere, and we add a way to open files without fully reading them into memory, users will still be able to work with them somewhat efficiently. Loading 500 MB in memory is not great, but as long as it happens only once, it's fairly tolerable and fixing it becomes "nice to have", not a must.

na-- avatar Apr 04 '23 11:04 na--

TL; DR The tar improvements are less obviously immediately valuable. I agree with that, and agree that we should prioritize #2977 over it πŸ‘πŸ»

I think parts of the idea with the tar archive lib on "steroids" was to do it hand in hand with #2977, with the assumption that one would be able to obtain a file handle of anything inside of the tar archive, without having to dump it twice in memory (once because the tar archive is in memory, and the copy of the content of the file).

Having done some research on this in the past, already having a more transparent API around files would help tremendously, even gaining granularity around how to handle data in scripts. As of today, users don't have much choice. It's all or nothing: load all the data in memory or nothing. Whereas with a File API, one could also open the files once and read them whenever they need the content, just in time (modern OSes all have some flavor of a buffer cache, which caches the content of read syscalls, so that when you read file N times in a row, all the reads past the first one are made from this cache, and are much MUCH quicker). This also has the potential to improve memory usage.

oleiade avatar Apr 04 '23 12:04 oleiade

lgtm. randomString with too large size may use more time for bootstrap. But load stress for db testing or kvstore testing may often need generate large value.

fly3366 avatar May 19 '23 10:05 fly3366

I have a need to load test an api that takes in pdfs. I was hoping that we could utilize SharedArray to share blobs (ArrayBuffer) across VUs. We are trying to achieve load testing with a wide range of pdf sizes up to 50MB. We need to simulate 2000 VU. Simple math shows that the feasibility is not in our favor if every VU needs to hold a 50MB blob in memory. Having away to share blobs across VU or having streaming support would make this more feasible. I personally think a streaming option in k6/http functions would be the most flexible and scalable. It would be a similar pattern seen in many code bases. We just started using k6 a few months ago to load test our mission critical services where I'm employed and so far its been a really good experience. I think k6 would benefit greatly and allow them to really expand their capabilities if they can crack handling of large sets of unstructured data, like pdfs, jpegs, etc.

Our POC to test SharedArray with ArrayBuffer. Currently doesn't work. Note: we are using typescript and transpiling with webpack using ts-loader. Not that should make a difference I don't think. Also if I am doing something wrong with this POC, please leave a comment. There could be others thinking the same approach as we did.

import { SharedArray } from "k6/data";

const data: ArrayBuffer[] = new SharedArray<ArrayBuffer>("pdfs", function (): ArrayBuffer[] {
    const data: ArrayBuffer[] = [open(path, 'b')];
    console.info("Bytes in LoadTestFiles: " + data[0].byteLength);
    return data;
});

//Start our k6 test
export default (): void => {
    console.info("Number of items in data: " + data.length);
    console.info("Bytes in VU: " + data[0].byteLength);
}

output. Notice the size being undefined in the test. image

jade-lucas avatar May 23 '23 16:05 jade-lucas

Hi @jade-lucas

Thanks a lot for your constructive feedback and your concrete use case. We are currently experimenting with this topic, and we expect the first improvements to land in k6 in the not-so-distant future (no ETA yet). We have prioritized #2977 and have #2978 on our radar.

We expect #2977 might help with your issue handling pdf files. Streaming in HTTP is, unfortunately further down the road as it is expected to potentially be part of the next http module we're working on at the moment (research phase). I'll make sure to keep you up when something concrete lands in k6 🀝

oleiade avatar May 24 '23 11:05 oleiade

I have ran into the same issue as jade-lucas. We need to load-test an API on a large scale with binary file upload. Having little knowledge of JavaScript buffers, I first couldn't understand why open(file) worked and open(file, "b") didn't when using SharedArray. I think a note about this in the documentation could help folks unfamiliar with the underlying implementation.

As mentioned, being able to share the contents of binary files between VUs and stream their contents over HTTP would be awesome and greatly aid our use-case.

Anyway, our experience of K6 has been amazing except for this one hurdle. Thanks for the great OSS πŸ™ŒπŸ»

dhbrojas avatar Oct 21 '23 01:10 dhbrojas

Quick update on this:

  • The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.
  • We have started working actively towards #2978 and expect to deliver it in the span of one or two releases.
  • This also servers #3038, in that eventually, we should be able to provide an HTTP client which would stream data from a k6/experimental/fs.File, and not have some of the other memory usage issues the current module has.

oleiade avatar Nov 30 '23 10:11 oleiade

  • The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.

Any docs on this? Can't find description at https://k6.io/docs/javascript-api/k6-experimental/

nk-tedo-001 avatar Dec 20 '23 14:12 nk-tedo-001

Hi @nk-tedo-001 πŸ‘‹πŸ»

Our docs have recently migrated to Grafana's, you can find more information about it there: https://grafana.com/docs/k6/latest/javascript-api/k6-experimental/fs/ πŸ™‡πŸ»

oleiade avatar Dec 20 '23 14:12 oleiade

With fs module k6 has now not exceeded memory usage!

Great job!

nk-tedo-001 avatar Dec 25 '23 13:12 nk-tedo-001

Thank you πŸ™‡πŸ» I'm glad it was helpful πŸŽ‰

oleiade avatar Dec 27 '23 15:12 oleiade