csvkit icon indicating copy to clipboard operation
csvkit copied to clipboard

agate: csvlook: ignore unequal columns between rows

Open die4live opened this issue 8 years ago • 23 comments

I like to use csvlook -t to view tab-separated file for good layout even it is not in a strict TSV format. For example one word in first line as a title and some more lines with three columns as content, which can be parsed by csvlook in older version of csvkit. However, for an updated version, the same command will throw out a error like 'Row 0 has 1 values, but Table only has 3 columns' and stop, and there's NO option to disable this rule.

So is it possible to add a argument to ignore unequal columns in file as what I want is just to align all columns vertically for a nice view, thanks!

die4live avatar Jan 13 '17 00:01 die4live

To reproduce: csvlook examples/bad.csv

jpmckinney avatar Jan 17 '17 05:01 jpmckinney

Is there a particular reason why this design choice was made @jpmckinney ? Just in case, I have a appropriate CSV file but if I do csvcut -c 31,32,39 example.csv | csvstat I still am stuck with the same error Row 0 has 11 values, but Table only has 3 columns. In effect, it renders csvkit something not useful for me at all - in general. I would imagine many other people would feel the same.

I would be happy to fix this if I understand the reasons behind this design choice.

denzilc avatar Jun 14 '17 08:06 denzilc

@denzilc We switched to agate for most of csvkit's functionality (see the above linked issue https://github.com/wireservice/agate/issues/666). However, agate doesn't (yet) support printing tables with ragged rows. We would either have to improve agate, or author some new code in csvkit (simply reverting back to the old implementation wouldn't be straightforward as we don't want to restore all the code that agate helped clean up).

jpmckinney avatar Jun 14 '17 14:06 jpmckinney

This issue is just about csvlook. Please create a new issue and upload your sample file if you can produce a similar error with csvstat.

jpmckinney avatar Jun 14 '17 14:06 jpmckinney

Hello all,

I also managed to reproduce an error that occurs at least in csvjoin and csvsort and which I believe is related to this issue. The error, I guess, is related to double double quotes. Please, see the test case below:

Consider the inputs:

csvtest1.csv

colid,colb,colc
DSCK940189,P,"An escaped ""column value"" here"

csvtest2.csv

colb,colc,colid
P,"An escaped ""column value"" here",DSCK940189

csvtest3.csv

colid,colb,colc,cold
DSCK940189,P,aa,bb

Notice that csvtest1.csv and csvtest2.csv have exactly the same contents, with just the order of the columns exchanged.

Now, consider the commands and outputs:

$ csvsort.exe csvtest2.csv -c "colc" # Succesful result
colb,colc,colid
P,"An escaped ""column value"" here",DSCK940189
csvsort.exe csvtest1.csv -c "colc" # Error
Row 0 has 5 values, but Table only has 1 columns.
csvjoin --left csvtest3.csv csvtest2.csv -c "colid" # Succesful result
colid,colb,colc,cold,colb2,colc2
DSCK940189,P,aa,bb,P,"An escaped ""column value"" here"
csvjoin --left csvtest3.csv csvtest1.csv -c "colid" # Error
Row 0 has 5 values, but Table only has 1 columns.

Using csvclean -n in all files shows no errors.

Notice that the results weren't supposed to be different, considering that both csvtest1 and csvtest2 files have the same column values. I tried using --escapechar, --quoting and other arguments, but was led to the same results.

It took me a time to debug this in a big csv file. Is csvkit able to output the exact row number when it encounters an error?

Please, let me know how can I help.

petersonjr avatar Jun 24 '17 01:06 petersonjr

@petersonjr In the small samples you gave, the issue is that csvkit is sniffing an incorrect CSV dialect. (In most cases, sniffing works well. In edge cases, it doesn't. So, it's still best to have sniffing on by default.) Anyway, if you add --snifflimit 0 to the commands, they will work again.

jpmckinney avatar Jun 24 '17 03:06 jpmckinney

Hello @jpmckinney , that did the trick, thank you! I only tested with the small examples I gave, but will also do with the files I was working with. As these sample were taken from them, I guess that this will work as well.

Would you mind answer these questions?

  • Whats the default dialect of csvkit? Looking through code I realized that it is the python default, which is excel. With the difference that the default line break is '\n' (excel is \r\n). Is this assumption correct?
  • The parameters like --quotechar, --quoting or -no-doublequote may be overridden by csvkit sniffing? Which will prevail? Why the example I gave fails even if I explicitly inform the correct arguments (--quotechar and so on)?

Also, what do you think of making the default dialect more explicit in the documentation, or this is an obvious information that I missed?

Thank you for your fast reply

petersonjr avatar Jun 24 '17 05:06 petersonjr

@petersonjr

  • Yes, if sniffing is turned off, csvkit will use the Python defaults, which are described here. Note that when reading, the Python defaults are flexible about whether the line terminator is \r\n, \r or \n. When writing, csvkit uses \n instead of \r\n, as you wrote.
  • The documentation for csv.reader says more specific parameters override the dialect parameter. I did some debugging, and the issue is that the sniffer was determining that whitespace was the delimiter. If you set --delimiter ",", the output will be correct.

jpmckinney avatar Jun 24 '17 16:06 jpmckinney

I created #848 to improve documentation.

jpmckinney avatar Jun 24 '17 16:06 jpmckinney

I'm having this problem with csvsql in csvkit (1.0.2).

Running csvsql --db postgresql://p:p@localhost:5432/p_unified --table PartD_Prescriber_PUF_NPI_15_raw --quoting 3 --insert --snifflimit 0 --delimiter "," '/Users/alison/Sites/prescriber/PartD_Prescriber_PUF_NPI_15.txt'

Returns this error: Row 20 has 69 values, but Table only has 67 columns.

This error happens even when I remove one or both of --snifflimit 0 --delimiter ",". Running a csvclean -n returns no errors.

awk 'FNR>19 && FNR <21' /Users/alison/Sites/p/PartD_Prescriber_PUF_NPI_15.txt returns: 1235283672,AADLAND,AARON,D.D.S.,M,I,1729 S CLIFF AVE,,SIOUX FALLS,57105,2126,SD,Dentist,T,N,51,51,313.82,302,31,,35,35,265.27,229,#,,,,,#,,,,0,0,,,,#,,,*,,,#,,,,,,,,41,181.78,29,,0,0,,0,,0,0,,0,71

@jpmckinney Any ideas besides downgrading csvkit?

alison985 avatar Dec 19 '17 17:12 alison985

@alison985 Assuming FNR is zero-based, then 'FNR>19 && FNR<21' is actually getting row 19; the row number reported by csvkit is zero-based, but it skips the header row, which awk doesn't skip. So, can you check 'FNR>20 && FNR<22' ?

jpmckinney avatar Dec 20 '17 04:12 jpmckinney

It's strange though that csvclean -n returns no errors, but csvsql does. csvsql uses agate whereas csvclean doesn't, so there may be a difference in logic that we can hopefully narrow down with your sample data.

jpmckinney avatar Dec 20 '17 04:12 jpmckinney

@jpmckinney 'FNR>20 && FNR<22' is: 1427070549,AADLAND,KRISTINE,DMD,F,I,2403 E EVERGREEN BLVD,,VANCOUVER,98661,,WA,Dentist,T,E,19,19,150.36,107,17,,19,19,150.36,107,,17,,0,0,,19,150.36,,0,0,#,,,,,,,,,#,,,0,0,0,0,0,17,107.41,15,,0,0,,0,,0,0,,0,71

alison985 avatar Dec 20 '17 05:12 alison985

@alison985 Hmm, so taking the two rows you've shared, I can't reproduce the error. Can you share the full file or a larger sample that reproduces the error?

jpmckinney avatar Dec 20 '17 14:12 jpmckinney

@jpmckinney I know - I can't find what it's coughing on them either. :(

I'm taking "Part D Prescriber Summary table, CY2015, (Tab Delimited Format) [135MB]" on this page: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/PartD2015.html However, that file has problems on line 5 million-and-something (also related to number of columns) so I run csvcut -t -c 1,2,3,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69 PartD_Prescriber_PUF_NPI_15.txt > PartD_Prescriber_PUF_NPI_15.txt.1 && rm PartD_Prescriber_PUF_NPI_15.txt && mv PartD_Prescriber_PUF_NPI_15.txt.1 PartD_Prescriber_PUF_NPI_15.txt to get to the file I'm trying.

(BTW, the original works fine with the postgres COPY command but AWS RDSs don't have superuser privileges needed to run a COPY hence I started down this journey.)

alison985 avatar Dec 20 '17 17:12 alison985

Since the error above was on line 20, I ran the first few lines only, but didn't get an error:

curl http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_15.zip
unzip PartD_Prescriber_PUF_NPI_15.zip
head -n 50 PartD_Prescriber_PUF_NPI_15/PartD_Prescriber_PUF_NPI_15.txt | csvsql -t --insert --db sqlite:///dummy.db

So I then tried running the entire file:

csvsql -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15/PartD_Prescriber_PUF_NPI_15.txt

I got this error:

CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE.

I didn't investigate which row caused that error; I just set quoting explicitly:

csvsql -u 3 -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15.txt

Then I got this error:

Row 537315 has 85 values, but Table only has 84 columns.

Updated:

That row has too many values, but only because I set quoting to csv.QUOTE_NONE, when that line actually quotes a tab character:

1033537345	KUTSENOK	ANNA		M.D.	F	I	101 NICOLLS RD	"STONY BROOK MEDICINE 	DEPARTMENT OF OB/GYN, HSC, T9"	STONY BROOK	11794	8091	NY	US	Student in an Organized Health Care Education/Training Program	T	N	11	11	195.82	123	11	*					*		*			*				0	0	*			*			*			*			0	0	0	0	0					0	0		0		0	0		0	56					11	0			0		0	0			1.0302

The file uses \r\n line endings, and I can't find any line longer than the header row, so I don't know why there's the error "CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE." Anyway, I tried adding --snifflimit 0 to disable any incorrect dialect sniffing:

csvsql --snifflimit 0 -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15.txt

Now it seems the whole file was imported!

jpmckinney avatar Dec 20 '17 21:12 jpmckinney

Please also consider the use-case where csvlook is being piped the output of some other program that takes a while to run. I was hoping that csvlook would be very useful for us to format the output so it can be read easily within the terminal, but this limitation makes it useless.

coxymla avatar Dec 22 '17 04:12 coxymla

@coxymla If a process takes a while to run, it's generally a good idea to redirect the output to a file, to avoid losing that time.

csvlook can't print ragged rows presently, but if you don't mind losing the extra cells in rows that are longer than the header row, you can run cat your.csv | csvcut -C "" | csvlook

jpmckinney avatar Dec 22 '17 05:12 jpmckinney

Thanks for the reply; the problem is that typically the first lines are the copyright notice, progress statements, etc. until the real header is finally printed out afterwards somewhere in the middle, finally followed by the rows of data.

coxymla avatar Dec 22 '17 05:12 coxymla

@coxymla If you know the number of lines that appear at the top of the file, you can try using --skip-lines.

jpmckinney avatar Dec 22 '17 05:12 jpmckinney

~~Reminder: Upstream issue is https://github.com/wireservice/agate/issues/666~~

jpmckinney avatar May 21 '18 16:05 jpmckinney

@jpmckinney I also had this issue, and put together the following Python 3 script which "sniffs" the first n lines (which you can change), decides on the max, and squares up the data with trailing commas. It's pretty fast, simple, takes no command-line arguments, and expects to be run as a filter, e.g.

./csvfix.py <myfile.csv

but it copes with enormous files, as it does not hold much in memory. https://gist.github.com/tuck1s/ed1c20ccb51b860b04dd838d7057364d

tuck1s avatar Jun 06 '18 22:06 tuck1s

@tuck1s You can achieve the same result with csvtool from the OCaml-CSV package:

csvtool setcolumns $(head -1000 "$inp" | csvtool width -) "$inp" 

The performance should be similar to your solution (uses the same logic), but you have to have the input as a file in variable $inp.

halloleo avatar Jun 07 '18 05:06 halloleo

Changing agate is not looking likely. Instead, I documented all the solutions in this thread at https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html#examples

edf0de7

jpmckinney avatar Oct 17 '23 20:10 jpmckinney