beam icon indicating copy to clipboard operation
beam copied to clipboard

[Bug][Python]: ReadFromCsv with the dtype argument is very slow

Open liferoad opened this issue 9 months ago • 0 comments

What happened?

Test code

import apache_beam as beam
import datetime

CSV_DATA = r"""a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
1,"text",1,21,5945023,376974,0,0,0,1,2,0,4,,,,,,
"""

filename = '/tmp/input.csv'
with open(filename, 'w') as f:
  f.write(CSV_DATA)

# less than 1 second
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
  ( pipeline | beam.io.ReadFromCsv('/tmp/input.csv')
    | beam.io.WriteToCsv('/tmp/output.csv')
  )
print('Elapsed time ', (datetime.datetime.now() - start))

# ~25 seconds with dtype=str 
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
  ( pipeline | beam.io.ReadFromCsv('/tmp/input.csv', dtype=str)
    | beam.io.WriteToCsv('/tmp/output.csv')
  )
print('Elapsed time ', (datetime.datetime.now() - start))

Problem

WriteToCsv does not matter here. When running this with %prun, output_partitioning_in_stage is dominated. We need to figure out why this happens and fix this performance issue.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • [X] Component: Python SDK
  • [ ] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [ ] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

liferoad avatar May 01 '24 20:05 liferoad