airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

[airbyte-cdk] Increase the maximum parseable field size for CSV files

Open blarghmatey opened this issue 1 year ago • 9 comments

The Python CSV library defaults to a maximum allowed length of 128k for a given field. This can cause issues when loading files that contain fields exceeding that length. This updates the parser to register the maximum allowable field size based on the runtime system.

Example error from an S3 source connector:

2024-03-17 17:35:01 source > Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. stream=raw__edxorg__s3__tables__certificates_generatedcertificate file=edxorg-raw-data/edxorg/raw_data/db_table/certificates_generatedcertificate/prod/MITx-6.041x_1-1T2015/bb970ead7355f6813844e92b66e80d6cbcfbff2dbdbcf49ca4daab5676632eaf.tsv line_no=14534 n_skipped=0
Stack Trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py", line 99, in read_records_from_slice
    for record in parser.parse_records(self.config, file, self.stream_reader, self.logger, schema):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 194, in parse_records
    for row in data_generator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 67, in read_data
    for row in reader:
  File "/usr/local/lib/python3.9/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

blarghmatey avatar Mar 20 '24 17:03 blarghmatey

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview May 7, 2024 7:31pm

vercel[bot] avatar Mar 20 '24 17:03 vercel[bot]

@brianjlai @marcosmarxm here's another patch to the CSV logic in the CDK

blarghmatey avatar Mar 20 '24 17:03 blarghmatey

Thanks for the contribution. @natikgadzhi can this be added to the next sprint?

marcosmarxm avatar Mar 21 '24 00:03 marcosmarxm

@marcosmarxm @natikgadzhi just checking in on the status for this PR

blarghmatey avatar Mar 29 '24 17:03 blarghmatey

Hey folks. @blarghmatey, sorry for the delay 🤦🏼 — my bad. And thank you for putting this together! Looking, give me a minute.

natikgadzhi avatar Apr 01 '24 22:04 natikgadzhi

@girarda @natikgadzhi let me know if there are any other changes that you need me to make to get this merged.

blarghmatey avatar Apr 09 '24 20:04 blarghmatey

@blarghmatey I'm going to coordinate with Natik and Alex about this change. Hope to get this merged soon.

marcosmarxm avatar Apr 10 '24 16:04 marcosmarxm

@marcosmarxm @natikgadzhi @girarda just checking in again on this. If we can get it merged and processed through into a new build of the S3 source this week that would be very helpful.

blarghmatey avatar Apr 17 '24 15:04 blarghmatey

@girarda @natikgadzhi I think I resolved the lint failure that it was running into. Can you do another round of review so we can hopefully get this merged?

blarghmatey avatar Apr 30 '24 17:04 blarghmatey

@natikgadzhi @marcosmarxm just another ping before the weekend. If I could have this ready for next week that would be great because I'm currently blocked on ingesting a chunk of data due to this bug.

blarghmatey avatar May 03 '24 17:05 blarghmatey

The change looks fine to me.

natikgadzhi avatar May 05 '24 18:05 natikgadzhi

Thanks for the approval @natikgadzhi. @marcosmarxm is there anything I can do to help merge this and publish the CDK?

blarghmatey avatar May 06 '24 16:05 blarghmatey

Thanks for your contribution @blarghmatey! I kicked off a CDK publish https://github.com/airbytehq/airbyte/actions/runs/8994319435

girarda avatar May 08 '24 00:05 girarda