FastFEC icon indicating copy to clipboard operation
FastFEC copied to clipboard

Incorrect number of trailing commas when last field(s) are empty

Open afischer opened this issue 3 years ago • 9 comments

FastFEC export seems to be missing a trailing comma in lines that have one or more empty items at the end of a row.

Using homebrew version of fastfec on a M1 MacBook Pro running macOS Montery 12.4.

For example, you can reproduce this by running fastfec 876050 fastfec_output/ and checking the header.csv (should be an additional trailing comma after report_number 002), SB28A.csv (42 fields in line items vs 43 in header) or SB23.csv (43 fields in line items vs 44 in header)

header.csv:

record_type,ef_type,fec_version,soft_name,soft_ver,report_id,report_number,comment
HDR,FEC,8.0,Microsoft Navision 3.60 - AVF Consulting,1.00,FEC-840327,002

afischer avatar Jul 12 '22 22:07 afischer

Hey, thanks for opening an issue! At the time I wrote the bulk of the code, this was the intended behavior. There are some errant filings with incorrect numbers of columns per line. To make the code as accurate as possible, I made it just output as many fields as are observed in the filing (and if there were more columns than headers, the remaining would just be printed for completeness). This quite literally reflects what is actually in the filing itself (and could be useful if you care to, say, detect filings with incorrect numbers of columns).

If you rerun FastFEC with the --warn parameter, e.g. fastfec --warn 876050 fastfec_output/, it will show many warning messages corresponding to these column mismatches.

Whether this is the correct behavior may warrant rethinking. Does having fewer columns break downstream tooling for you? Do you think it's more accurate to put in trailing commas for missing fields (in this case, what should be done for extra fields in a given row)? Maybe we could expose an option to pad missing fields with commas, or make this the default behavior. I'm curious to hear your/others' thoughts!

freedmand avatar Jul 18 '22 07:07 freedmand

I'd vote for outputting the same number of commas as the header, because I suspect some import processes will choke otherwise. I vaguely remember participating in the discussion Dylan's talking about. WinRed seems to output one more separator than you'd expect, for instance. But I feel like a consistent number of commas would be the correct normalization.

chriszs avatar Jul 18 '22 17:07 chriszs

Hey @freedmand, thanks for the detailed response! I would also advocate for padding out to the correct number of commas regardless of the contents of the fecfile. I discovered this when attempting to use the COPY / FROM postgres command to easily pull a form into a database.

afischer avatar Jul 18 '22 17:07 afischer

Great, that makes sense! In the case of too many fields in a given row (i.e. more fields than header columns), do you think just truncating would make sense Andrew/Chris? We could make that the default and then expose a raw option to preserve the source filing as closely as possible

freedmand avatar Jul 18 '22 17:07 freedmand

I'd maybe draw a distinction between empty excess fields (like WinRed) and non-empty. Empty you can eat no problem. Non-empty typically means something has gone wrong and it might be useful to output those or error just so we catch it. I could see an argument for truncation regardless, though, but even then it should warn.

chriszs avatar Jul 18 '22 17:07 chriszs

So, essentially:

  • by default, always output exactly as many fields as headers
  • don't warn at all if there's a mismatch in field amount that entails too few fields or too many fields but the extra ones are empty
  • warn by default otherwise; have a silent option to suppress warnings
  • have a non-default raw option to output exactly what's in the filing even if it doesn't match the number of header cols

freedmand avatar Jul 18 '22 17:07 freedmand

That sounds great. I think having, by default, a set of CSVs that have a guaranteed number fields on all rows would be best here, though I agree it makes sense to have a "raw" option as well as warnings where helpful.

afischer avatar Jul 27 '22 13:07 afischer

I have this fixed in https://github.com/washingtonpost/FastFEC/pull/58. Please @chriszs @afischer take a look and give @freedmand some help.

NickCrews avatar Apr 13 '23 19:04 NickCrews

FYI, in the new version of DuckDB's CSV parser, they added an option that makes the parser more resilient to messy CSV data such as this. I am able to parse these CSVs with their missing trailing commas by using the null_padding option:

import duckdb

conn = duckdb.connect()
conn.read_csv(p, all_varchar=True,  quotechar='"', null_padding=True)

So this issue is a lot less pressing for me now.

NickCrews avatar Nov 25 '23 22:11 NickCrews