skywalking icon indicating copy to clipboard operation
skywalking copied to clipboard

[Bug] Unexpected `JFR_UPLOAD_FILE_TOO_LARGE_ERROR` for Async-Profiler task

Open fandreuz opened this issue 2 months ago • 9 comments

Search before asking

  • [x] I had searched in the issues and found no similar issues.

Apache SkyWalking Component

Java Agent (apache/skywalking-java)

What happened

I'm getting JFR_UPLOAD_FILE_TOO_LARGE_ERROR as a result to my Async-Profiler tasks. I'm profiling Renaissance all benchmark.

I set a large value for SW_RECEIVER_ASYNC_PROFILER_JFR_MAX_SIZE (hundreds of GBs just to be sure), so this is quite unexpected.

The problem happens frequently with duration=15mins, sporadically with duration=10mins, never with duration=5mins. Selecting other parallel profiling modes (alloc, lock, wall) gives the same problem even with a 5mins profiling window.

What you expected to happen

This behavior is unexpected since the JFR I get from plain Async-Profiler is less than 100MBs.

How to reproduce

  • Launch Renaissance all benchmark
  • Create an Async-Profiler task in SkyWalking OAP server
  • Select any CPU sampling mode (CPU, ITIMER or CTIMER)
  • Select 15mins duration
  • Start the task

Anything else

I started the OAP server with a slightly modified quickstart-docker.sh to set SW_RECEIVER_ASYNC_PROFILER_JFR_MAX_SIZE from an env-file:

docker compose -f "$temp_dir/docker-compose.yml" \
  --project-name=skywalking-quickstart \
  --profile=$SW_STORAGE \
  --env-file=/home/fandreuz/sky-test/env \
  up \
  --detach=${DETACHED:-true} \
  --wait

/home/fandreuz/sky-test/env:

SW_RECEIVER_ASYNC_PROFILER_JFR_MAX_SIZE=1000524288000000000

Are you willing to submit a pull request to fix on your own?

  • [ ] Yes I am willing to submit a pull request on my own!

Code of Conduct

fandreuz avatar Oct 23 '25 09:10 fandreuz

The settings you use,

SW_RECEIVER_ASYNC_PROFILER_JFR_MAX_SIZE=1000524288000000000

is much larger than the INT_MAX.

lujiajing1126 avatar Oct 23 '25 14:10 lujiajing1126

is much larger than the INT_MAX.

It's so large because I added zeros in subsequent iterations, I tried smaller values as well.

fandreuz avatar Oct 23 '25 14:10 fandreuz

is much larger than the INT_MAX.

It's so large because I added zeros in subsequent iterations, I tried smaller values as well.

What kind of values have you tried?

lujiajing1126 avatar Oct 23 '25 14:10 lujiajing1126

@wu-sheng I check the code of the Java Agent, the field of contentSize defined in the protocol is int32. So the max file size should be exceed ~1.999 GB. Shall we modify the protocol first?

lujiajing1126 avatar Oct 27 '25 01:10 lujiajing1126

Is that reasonable to have a 2G profiling data upload and ask OAP analysis?

I am fine with that, but TBH using TCP to upload 2G data seems a little crazy.

wu-sheng avatar Oct 27 '25 02:10 wu-sheng

Is that reasonable to have a 2G profiling data upload and ask OAP analysis?

I am fine with that, but TBH using TCP to upload 2G data seems a little crazy.

It depends...For profiling with allocs, it is possible even for a short-term.

Any idea for transport optimization?

lujiajing1126 avatar Oct 28 '25 12:10 lujiajing1126

If it is a big profiling file, I would say at least file-based analysis is preferred. How is the file system working now?

wu-sheng avatar Oct 28 '25 13:10 wu-sheng

file-based analysis

What do you mean by "file-based analysis"?

lujiajing1126 avatar Oct 29 '25 05:10 lujiajing1126

such as

  • Is this large file proper for gRPC transportation?
  • During the analysis, do we need to load all the contents of files into the memory?

wu-sheng avatar Oct 29 '25 06:10 wu-sheng