jsoniter-scala
jsoniter-scala copied to clipboard
Parallel Processing with JSON Lines files
Hi,
I'm writing a stream processing application that reads files from S3, gzips them, and processes the data. The files are in JSON lines format, meaning each line in the file is a separate JSON object that I need to parse.
Currently, I decode each line into a string and parse each line in parallel. However, this approach is inefficient in terms of memory (creating many string objects results in significant GC pressure) and CPU usage.
I would like to explore the option of using scanJsonValuesFromStream but I'm unsure how to parallelize the work similarly to splitting the lines after decoding to a string (i.e., parsing each line of the file in parallel).
I've read the thread in this issue, which seems to have a somewhat similar format, but I couldn't figure out how to further parallelize the work.
Additionally, I am uncertain about the appropriate values for preferredBufSize and preferredCharBufSize.
For context, I am using ZIO Streams for stream processing.
Any help would be much appreciated.
Thanks!