[Bug][Python]: ReadFromCsv with the dtype argument is very slow

Open liferoad opened this issue 9 months ago • 0 comments

What happened?

Test code

import apache_beam as beam
import datetime

CSV_DATA = r"""a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
1,"text",1,21,5945023,376974,0,0,0,1,2,0,4,,,,,,
"""

filename = '/tmp/input.csv'
with open(filename, 'w') as f:
  f.write(CSV_DATA)

# less than 1 second
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
  ( pipeline | beam.io.ReadFromCsv('/tmp/input.csv')
    | beam.io.WriteToCsv('/tmp/output.csv')
  )
print('Elapsed time ', (datetime.datetime.now() - start))

# ~25 seconds with dtype=str 
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
  ( pipeline | beam.io.ReadFromCsv('/tmp/input.csv', dtype=str)
    | beam.io.WriteToCsv('/tmp/output.csv')
  )
print('Elapsed time ', (datetime.datetime.now() - start))

Problem

WriteToCsv does not matter here. When running this with %prun, output_partitioning_in_stage is dominated. We need to figure out why this happens and fix this performance issue.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

[X] Component: Python SDK
[ ] Component: Java SDK
[ ] Component: Go SDK
[ ] Component: Typescript SDK
[ ] Component: IO connector
[ ] Component: Beam YAML
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[ ] Component: Google Cloud Dataflow Runner

May 01 '24 20:05 liferoad

beam beam copied to clipboard

[Bug][Python]: ReadFromCsv with the dtype argument is very slow

What happened?

Test code

Problem

Issue Priority

Issue Components

beam
beam copied to clipboard