airbyte
airbyte copied to clipboard
[airbyte-cdk] Increase the maximum parseable field size for CSV files
The Python CSV library defaults to a maximum allowed length of 128k for a given field. This can cause issues when loading files that contain fields exceeding that length. This updates the parser to register the maximum allowable field size based on the runtime system.
Example error from an S3 source connector:
2024-03-17 17:35:01 source > Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. stream=raw__edxorg__s3__tables__certificates_generatedcertificate file=edxorg-raw-data/edxorg/raw_data/db_table/certificates_generatedcertificate/prod/MITx-6.041x_1-1T2015/bb970ead7355f6813844e92b66e80d6cbcfbff2dbdbcf49ca4daab5676632eaf.tsv line_no=14534 n_skipped=0
Stack Trace: Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py", line 99, in read_records_from_slice
for record in parser.parse_records(self.config, file, self.stream_reader, self.logger, schema):
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 194, in parse_records
for row in data_generator:
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 67, in read_data
for row in reader:
File "/usr/local/lib/python3.9/csv.py", line 111, in __next__
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| airbyte-docs | ⬜️ Ignored (Inspect) | Visit Preview | May 7, 2024 7:31pm |
@brianjlai @marcosmarxm here's another patch to the CSV logic in the CDK
Thanks for the contribution. @natikgadzhi can this be added to the next sprint?
@marcosmarxm @natikgadzhi just checking in on the status for this PR
Hey folks. @blarghmatey, sorry for the delay 🤦🏼 — my bad. And thank you for putting this together! Looking, give me a minute.
@girarda @natikgadzhi let me know if there are any other changes that you need me to make to get this merged.
@blarghmatey I'm going to coordinate with Natik and Alex about this change. Hope to get this merged soon.
@marcosmarxm @natikgadzhi @girarda just checking in again on this. If we can get it merged and processed through into a new build of the S3 source this week that would be very helpful.
@girarda @natikgadzhi I think I resolved the lint failure that it was running into. Can you do another round of review so we can hopefully get this merged?
@natikgadzhi @marcosmarxm just another ping before the weekend. If I could have this ready for next week that would be great because I'm currently blocked on ingesting a chunk of data due to this bug.
The change looks fine to me.
Thanks for the approval @natikgadzhi. @marcosmarxm is there anything I can do to help merge this and publish the CDK?
Thanks for your contribution @blarghmatey! I kicked off a CDK publish https://github.com/airbytehq/airbyte/actions/runs/8994319435