smarter_csv
smarter_csv copied to clipboard
Improved column separator detection by ignoring quoted sections
Summary:
This pull request enhances the logic used to determine the column separator (delimiter) in CSV files processed by our system. Previously, the method guess_column_separator simply counted occurrences of potential delimiters (such as commas, tabs, semicolons, colons, and pipes) without considering their context. This could lead to misidentification, especially when non-delimiter characters within quoted fields were mistaken for actual delimiters. The updated logic now intelligently ignores delimiters found within quoted sections, leading to more accurate delimiter detection.
Changes:
- [x] Modified the guess_column_separator method to exclude text within quotes when counting delimiter occurrences.
- [x] Added regex-based splitting to remove quoted sections before counting delimiter occurrences in each line.
- [x] Added tests
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 100.00%. Comparing base (
07160c1
) to head (b812d03
).
Additional details and impacted files
@@ Coverage Diff @@
## main #276 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 11 11
Lines 380 382 +2
=========================================
+ Hits 380 382 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@nicastelo sorry for the delay - I'll have a look at it this week
looks good! 👍