jsoniter-scala icon indicating copy to clipboard operation
jsoniter-scala copied to clipboard

Parallel Processing with JSON Lines files

Open yyaarix opened this issue 1 year ago • 5 comments

Hi,

I'm writing a stream processing application that reads files from S3, gzips them, and processes the data. The files are in JSON lines format, meaning each line in the file is a separate JSON object that I need to parse.

Currently, I decode each line into a string and parse each line in parallel. However, this approach is inefficient in terms of memory (creating many string objects results in significant GC pressure) and CPU usage.

I would like to explore the option of using scanJsonValuesFromStream but I'm unsure how to parallelize the work similarly to splitting the lines after decoding to a string (i.e., parsing each line of the file in parallel).

I've read the thread in this issue, which seems to have a somewhat similar format, but I couldn't figure out how to further parallelize the work.

Additionally, I am uncertain about the appropriate values for preferredBufSize and preferredCharBufSize.

For context, I am using ZIO Streams for stream processing.

Any help would be much appreciated.

Thanks!

yyaarix avatar Aug 04 '24 06:08 yyaarix