cobrix
cobrix copied to clipboard
Speeding up reads
I am doing some benchmarking on a Databricks cluster where I use Cobrix to read EBCDIC files and write to parquet. I have an implementation of the same process which does not use this library. Reading a 2GB EBCDIC file with Cobrix takes two minutes longer than reading the file using sc.binaryRecords() and putting the right schema in place to create a DataFrame. The file has around 1400 columns and 9000 bytes per record.
Here is the cluster config: Spark Version: 2.4 Worker count: flexible, depending on the workload RAM per worker: 14 GB Cores per worker: 4 Executor count: flexible, depending on the workload Executor memory: 7.4 GB
The benchmarks you have included in this project make me think that an increase in executor count would speed up the read throughput. However, I have currently enabled autoscaling in Databricks so it dynamically allocates executors on the fly depending on the workload.
Could you please provide some guidelines on optimising read speed? Things like configurations you have tried in your organisation's use case can really be of help to speed up the process.
For fixed length files binaryRecords() is used in Cobrix as well. Since you use binaryRecords() I presume that the file is a fixed record length sequence of records.
The time it takes to read these files in Cobrix is 2 minutes longer, but what is the total time? Just wondering about relative performance.
Increasing workers should improve the performance, yes. Although for small files, like 2G, the difference won't probably be very big.
Not sure if autoscaling is very helpful for small jobs like converting a file from EBCDIC to Parquet. If possible, we'd be interested in comparing your current process vs Cobrix on the same number of executors. Maybe the exact number of executors used can be retrieved from the execution logs.
What kind of fields does your copybook has? I mean is there a dominant data type, like COMP-3 numbers or most fields are strings (A orX). The slowness may be due to inefficient parsing of particular data types. We have encountered such an issue in earlier versions of Cobrix for COMP-3 numbers, for instance.
Yes, the file has fixed length records. It takes around 4 minutes to read the 2G file and write to parquet. When monitoring the read stage using the Spark UI, I can see that the size of the read input grows slower when using Cobrix compared to the other implementation which uses binaryRecords(). It ingests the whole 2G input in 4 minutes which seems slow.
Databricks uses 4 executors, each of which receives around 540 MB input. The most common data types are S9, 9 and X. There are no COMP-3 numbers.
This is very interesting. So the types have mostly DISPLAY format, right? Does the output Spark schema contains only String type, or there are numbers as well?
Because these are types require only EBCDIC to ASCII decoding and string to number conversion it might be that direct usage of binaryRecords() to a hardcoded schema might be faster. But Cobrix runs 2 times slower, which is hard to explain by a necessity of generic schema traversal and decoding overhead alone.
I'm going to keep this issue open and I'd like to generate a data file similar to the profile you described (1.5k fields, 9000 record size, S9, 9, X DISPLAY types). We could profile Cobrix to find out if there is a bottleneck that causes the slowdown.