cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

Optimization of input file split

Open sree018 opened this issue 2 years ago • 4 comments

Background

I have fixed length of 200 bytes file with 100 multi segments present in copybook with 2k columns. It is taking almost 12 hours for just 1gb of file.

Feature

Can you add feature for input file split in mb for all file formats, currently it is working for record format =VB, where input_split_size_mb

I tried to adjust block size in spark code, but cobrix taking default cluster block size.

Proposed Solution [Optional]

Solution Ideas

  1. Allow this input_split_size_mb for all file formats
  2. How to take custom block size specified in spark configuration. Ex:spark.conf.set(“dfs.blocksize”,”32m”)

sree018 avatar Feb 20 '23 11:02 sree018

Hi what's your code snippet?

Cobrix should be more efficient for F format than V or VB formats. Not sure why you are having performance issues.

yruslan avatar Feb 21 '23 08:02 yruslan

Hi @yruslan ,

I am using cobrix 2.6.3 version and please see cobrix options 2B8883B5-682A-41D5-806C-5513B4D80CD2

Please let me know, how to optimize job?

sree018 avatar Feb 22 '23 12:02 sree018

You can try:

  • Removing the 'segment_id_level0' option just to see if there are any performance improvements.
  • When the number of columns is so big, Spark can spend a lot of time creating the execution plan. You can try remove further processing (flattenning etc) for now to see if it improves the performance.

The idea of the above exercises is to understand what impacts the performance - data decoding or other transformations.

The screenshot shows only 8 segment mappings. Are there 8 segments or 100?

Otherwise the code looks good.

yruslan avatar Feb 22 '23 15:02 yruslan

Hi @yruslan

Due to security reasons, I can’t publish copybook here, but file 200 bytes fixed length with 100 multi segments, I grouped to 8 segments, under 8 segments secondary split happens. 6208AE3E-B39E-4B2F-B529-C5D932475083 FFFEC5D1-F035-4422-9B7E-9BAEA1B1A3D5 925F636A-D16E-4588-81D1-36BD88B263CD 305E50E5-9CC1-471D-ADC5-BF11FA5B8C72

sree018 avatar Feb 23 '23 11:02 sree018