feat(api): more robust CSV parsing and usage in run context for CSV parameters

Open jbleon95 opened this issue 1 year ago • 0 comments

Overview

Closes AUTH-419.

This PR fleshes out some of the work done by #15434 to add a contents property to the CSV parameter interface as used in the run context (e.g. a parameter called csv_data as would be accessed via protocol_context.params.csv_data), as well as more robustly parse the CSV file contents for usage with the rows() function.

CSV files can come in a number of different formats, called dialects by the Python CSV library. This determines things like the delimiter, how quotes are escaped, whether initial spaces are skipped etc. Because there is a lack of strict standardization in the format, any combination of those properties is still a valid CSV file. Fortunately the python CSV library provides a Sniffer() class that can try and detect which dialect is being used by being given a sample of the CSV file contents. For example, look at this CSV file example (all examples in this description taken from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html):

"Year", "Mileage (thousands)",  "Price"
1998,  27,    9991 
1997,  17,    9925 
1998,  28,   10491 
1998,   5,   10990 
1997,  38,    9493 
1997,  36,    9991 
1997,  24,   10490 
1997,  37,    9491 
1997,  38,    9491 
1997,  30,    9990 
1997,  38,    9491 
1997,  25,    9990 
1997,  39,    9990 
1997,  22,    9390 
1997,  24,    9990 
1997,  37,    9990 
1997,  29,    9990 
1997,  70,    8990 
1996,  29,    7990 
1995,  72,    5994 
1993,  72,    5994 
1994,  61,    5500 
1998,   7,   11000

A pretty straightforward CSV file with the header strings in quotes and then three columns of numeric data. Yet without dialect parsing, the header and the first three rows look as such:

['Year', ' "Mileage (thousands)"', '  "Price"'],
['1998', '  27', '    9991 '], 
['1997', '  17', '    9925 '], 
['1998', '  28', '   10491 '],

Notice how the initial space is being taken as input and the quotes are then being included in the element for the header, and how the initial variable spacing before the second two elements of the data fields also include initial spaces. By providing the contents of the file to the sniffer, the output instead looks like:

['Year', 'Mileage (thousands)', 'Price'],
['1998', '27', '9991 '],
['1997', '17', '9925 '],
['1998', '28', '10491 ']

Again, not perfect with the trailing space at the end of the third elements of the data fields, but much more accurate to what the writer of this CSV file intended.

Unfortunately this sniffer functionality is not entirely perfect in determining the accurate dialect. For another example consider this next headerless and slightly deranged CSV file:

John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123

Without any dialect sniffing, the CSV reader parses this mostly correctly barring some spacing issues:

['John', 'Doe', '120 jefferson st.', 'Riverside', ' NJ', ' 08075'],
['Jack', 'McGinnis', '220 hobo Av.', 'Phila', ' PA', '09119'],
['John "Da Man"', 'Repici', '120 Jefferson St.', 'Riverside', ' NJ', '08075'],
['Stephen', 'Tyler', '7452 Terrace "At the Plaza" road', 'SomeTown', 'SD', ' 91234'],
['', 'Blankman', '', 'SomeTown', ' SD', ' 00298'], 
['Joan "the bone", Anne', 'Jet', '9th, at Terrace plc', 'Desert City', 'CO', '00123']

Yet with dialect sniffing it determines, due to the inconsistent spacing in the CSV file, that the delimiter is a literal space character and then parses it incredibly incorrectly like so:

['John,Doe,120', 'jefferson', 'st.,Riverside,', 'NJ,', '08075'],
['Jack,McGinnis,220', 'hobo', 'Av.,Phila,', 'PA,09119'],
['John "Da', 'Man""",Repici,120', 'Jefferson', 'St.,Riverside,', 'NJ,08075'],
['Stephen,Tyler,"7452', 'Terrace', 'At', 'the', 'Plaza""', 'road",SomeTown,SD,', '91234'],
[',Blankman,,SomeTown,', 'SD,', '00298'],
['Joan "the', 'bone"",', 'Anne",Jet,"9th,', 'at', 'Terrace', 'plc",Desert', 'City,CO,00123']

In order to accommodate for this potential misreading of the dialect, the CSV row parser as we have it in the CSVParameter class will always set the delimiter to a comma. This will be the one standard that we enforce.

TL;DR: CSV files are super inconsistent and we'll try our hardest to parse it but no guarantees if your data is super funky.

As far as error handing, if any csv.Error rises when we try to sniff the dialect, we'll default to trying to parse with no sniffed dialect. If a csv.Error is raised during parsing of the rows, or a UnicodeDecodeError is raised at any point, we'll fail the protocol.

Test Plan

Not of ton of end to end testing since the data file endpoint and setting parameters is not fully implemented, but tested the parsing on the above example files as well as others, plus unit testing.

Changelog

Added contents property to CSVParameter that returns one string of the entire contents of the file
More robust CSV file content parsing for CSVParameter.rows()

Review requests

If we can't parse the CSV rows, should we fail the protocol or should we just disallow access to rows? If so, how do we best communicate it.

Risk assessment

Low.

Jun 25 '24 14:06 jbleon95