beam
beam copied to clipboard
[Bug][Python]: ReadFromCsv with the dtype argument is very slow
What happened?
Test code
import apache_beam as beam
import datetime
CSV_DATA = r"""a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
1,"text",1,21,5945023,376974,0,0,0,1,2,0,4,,,,,,
"""
filename = '/tmp/input.csv'
with open(filename, 'w') as f:
f.write(CSV_DATA)
# less than 1 second
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
( pipeline | beam.io.ReadFromCsv('/tmp/input.csv')
| beam.io.WriteToCsv('/tmp/output.csv')
)
print('Elapsed time ', (datetime.datetime.now() - start))
# ~25 seconds with dtype=str
start = datetime.datetime.now()
with beam.Pipeline() as pipeline:
( pipeline | beam.io.ReadFromCsv('/tmp/input.csv', dtype=str)
| beam.io.WriteToCsv('/tmp/output.csv')
)
print('Elapsed time ', (datetime.datetime.now() - start))
Problem
WriteToCsv
does not matter here. When running this with %prun
, output_partitioning_in_stage is dominated. We need to figure out why this happens and fix this performance issue.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner