smarter_csv icon indicating copy to clipboard operation
smarter_csv copied to clipboard

Improved column separator detection by ignoring quoted sections

Open nicastelo opened this issue 11 months ago • 1 comments

Summary:

This pull request enhances the logic used to determine the column separator (delimiter) in CSV files processed by our system. Previously, the method guess_column_separator simply counted occurrences of potential delimiters (such as commas, tabs, semicolons, colons, and pipes) without considering their context. This could lead to misidentification, especially when non-delimiter characters within quoted fields were mistaken for actual delimiters. The updated logic now intelligently ignores delimiters found within quoted sections, leading to more accurate delimiter detection.

Changes:

  • [x] Modified the guess_column_separator method to exclude text within quotes when counting delimiter occurrences.
  • [x] Added regex-based splitting to remove quoted sections before counting delimiter occurrences in each line.
  • [x] Added tests

nicastelo avatar Mar 14 '24 13:03 nicastelo

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (07160c1) to head (b812d03).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #276   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           11        11           
  Lines          380       382    +2     
=========================================
+ Hits           380       382    +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 16 '24 16:03 codecov[bot]

@nicastelo sorry for the delay - I'll have a look at it this week

tilo avatar Jul 10 '24 01:07 tilo

looks good! 👍

tilo avatar Jul 10 '24 02:07 tilo