csvkit
csvkit copied to clipboard
agate: csvlook: ignore unequal columns between rows
I like to use csvlook -t
to view tab-separated file for good layout even it is not in a strict TSV format. For example one word in first line as a title and some more lines with three columns as content, which can be parsed by csvlook
in older version of csvkit. However, for an updated version, the same command will throw out a error like 'Row 0 has 1 values, but Table only has 3 columns' and stop, and there's NO option to disable this rule.
So is it possible to add a argument to ignore unequal columns in file as what I want is just to align all columns vertically for a nice view, thanks!
To reproduce: csvlook examples/bad.csv
Is there a particular reason why this design choice was made @jpmckinney ? Just in case, I have a appropriate CSV file but if I do csvcut -c 31,32,39 example.csv | csvstat
I still am stuck with the same error Row 0 has 11 values, but Table only has 3 columns
. In effect, it renders csvkit
something not useful for me at all - in general. I would imagine many other people would feel the same.
I would be happy to fix this if I understand the reasons behind this design choice.
@denzilc We switched to agate
for most of csvkit's functionality (see the above linked issue https://github.com/wireservice/agate/issues/666). However, agate
doesn't (yet) support printing tables with ragged rows. We would either have to improve agate
, or author some new code in csvkit
(simply reverting back to the old implementation wouldn't be straightforward as we don't want to restore all the code that agate
helped clean up).
This issue is just about csvlook
. Please create a new issue and upload your sample file if you can produce a similar error with csvstat
.
Hello all,
I also managed to reproduce an error that occurs at least in csvjoin and csvsort and which I believe is related to this issue. The error, I guess, is related to double double quotes. Please, see the test case below:
Consider the inputs:
csvtest1.csv
colid,colb,colc
DSCK940189,P,"An escaped ""column value"" here"
csvtest2.csv
colb,colc,colid
P,"An escaped ""column value"" here",DSCK940189
csvtest3.csv
colid,colb,colc,cold
DSCK940189,P,aa,bb
Notice that csvtest1.csv and csvtest2.csv have exactly the same contents, with just the order of the columns exchanged.
Now, consider the commands and outputs:
$ csvsort.exe csvtest2.csv -c "colc" # Succesful result
colb,colc,colid
P,"An escaped ""column value"" here",DSCK940189
csvsort.exe csvtest1.csv -c "colc" # Error
Row 0 has 5 values, but Table only has 1 columns.
csvjoin --left csvtest3.csv csvtest2.csv -c "colid" # Succesful result
colid,colb,colc,cold,colb2,colc2
DSCK940189,P,aa,bb,P,"An escaped ""column value"" here"
csvjoin --left csvtest3.csv csvtest1.csv -c "colid" # Error
Row 0 has 5 values, but Table only has 1 columns.
Using csvclean -n in all files shows no errors.
Notice that the results weren't supposed to be different, considering that both csvtest1 and csvtest2 files have the same column values. I tried using --escapechar, --quoting and other arguments, but was led to the same results.
It took me a time to debug this in a big csv file. Is csvkit able to output the exact row number when it encounters an error?
Please, let me know how can I help.
@petersonjr In the small samples you gave, the issue is that csvkit is sniffing an incorrect CSV dialect. (In most cases, sniffing works well. In edge cases, it doesn't. So, it's still best to have sniffing on by default.) Anyway, if you add --snifflimit 0
to the commands, they will work again.
Hello @jpmckinney , that did the trick, thank you! I only tested with the small examples I gave, but will also do with the files I was working with. As these sample were taken from them, I guess that this will work as well.
Would you mind answer these questions?
- Whats the default dialect of csvkit? Looking through code I realized that it is the python default, which is excel. With the difference that the default line break is '\n' (excel is \r\n). Is this assumption correct?
- The parameters like --quotechar, --quoting or -no-doublequote may be overridden by csvkit sniffing? Which will prevail? Why the example I gave fails even if I explicitly inform the correct arguments (--quotechar and so on)?
Also, what do you think of making the default dialect more explicit in the documentation, or this is an obvious information that I missed?
Thank you for your fast reply
@petersonjr
- Yes, if sniffing is turned off, csvkit will use the Python defaults, which are described here. Note that when reading, the Python defaults are flexible about whether the line terminator is
\r\n
,\r
or\n
. When writing, csvkit uses\n
instead of\r\n
, as you wrote. - The documentation for
csv.reader
says more specific parameters override thedialect
parameter. I did some debugging, and the issue is that the sniffer was determining that whitespace was the delimiter. If you set--delimiter ","
, the output will be correct.
I created #848 to improve documentation.
I'm having this problem with csvsql in csvkit (1.0.2).
Running csvsql --db postgresql://p:p@localhost:5432/p_unified --table PartD_Prescriber_PUF_NPI_15_raw --quoting 3 --insert --snifflimit 0 --delimiter "," '/Users/alison/Sites/prescriber/PartD_Prescriber_PUF_NPI_15.txt'
Returns this error: Row 20 has 69 values, but Table only has 67 columns.
This error happens even when I remove one or both of --snifflimit 0 --delimiter ","
. Running a csvclean -n
returns no errors.
awk 'FNR>19 && FNR <21' /Users/alison/Sites/p/PartD_Prescriber_PUF_NPI_15.txt
returns:
1235283672,AADLAND,AARON,D.D.S.,M,I,1729 S CLIFF AVE,,SIOUX FALLS,57105,2126,SD,Dentist,T,N,51,51,313.82,302,31,,35,35,265.27,229,#,,,,,#,,,,0,0,,,,#,,,*,,,#,,,,,,,,41,181.78,29,,0,0,,0,,0,0,,0,71
@jpmckinney Any ideas besides downgrading csvkit?
@alison985 Assuming FNR is zero-based, then 'FNR>19 && FNR<21' is actually getting row 19; the row number reported by csvkit is zero-based, but it skips the header row, which awk doesn't skip. So, can you check 'FNR>20 && FNR<22' ?
It's strange though that csvclean -n
returns no errors, but csvsql
does. csvsql
uses agate whereas csvclean
doesn't, so there may be a difference in logic that we can hopefully narrow down with your sample data.
@jpmckinney 'FNR>20 && FNR<22' is: 1427070549,AADLAND,KRISTINE,DMD,F,I,2403 E EVERGREEN BLVD,,VANCOUVER,98661,,WA,Dentist,T,E,19,19,150.36,107,17,,19,19,150.36,107,,17,,0,0,,19,150.36,,0,0,#,,,,,,,,,#,,,0,0,0,0,0,17,107.41,15,,0,0,,0,,0,0,,0,71
@alison985 Hmm, so taking the two rows you've shared, I can't reproduce the error. Can you share the full file or a larger sample that reproduces the error?
@jpmckinney I know - I can't find what it's coughing on them either. :(
I'm taking "Part D Prescriber Summary table, CY2015, (Tab Delimited Format) [135MB]" on this page: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/PartD2015.html However, that file has problems on line 5 million-and-something (also related to number of columns) so I run csvcut -t -c 1,2,3,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69 PartD_Prescriber_PUF_NPI_15.txt > PartD_Prescriber_PUF_NPI_15.txt.1 && rm PartD_Prescriber_PUF_NPI_15.txt && mv PartD_Prescriber_PUF_NPI_15.txt.1 PartD_Prescriber_PUF_NPI_15.txt
to get to the file I'm trying.
(BTW, the original works fine with the postgres COPY command but AWS RDSs don't have superuser privileges needed to run a COPY hence I started down this journey.)
Since the error above was on line 20, I ran the first few lines only, but didn't get an error:
curl http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_15.zip
unzip PartD_Prescriber_PUF_NPI_15.zip
head -n 50 PartD_Prescriber_PUF_NPI_15/PartD_Prescriber_PUF_NPI_15.txt | csvsql -t --insert --db sqlite:///dummy.db
So I then tried running the entire file:
csvsql -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15/PartD_Prescriber_PUF_NPI_15.txt
I got this error:
CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE.
I didn't investigate which row caused that error; I just set quoting explicitly:
csvsql -u 3 -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15.txt
Then I got this error:
Row 537315 has 85 values, but Table only has 84 columns.
Updated:
That row has too many values, but only because I set quoting to csv.QUOTE_NONE
, when that line actually quotes a tab character:
1033537345 KUTSENOK ANNA M.D. F I 101 NICOLLS RD "STONY BROOK MEDICINE DEPARTMENT OF OB/GYN, HSC, T9" STONY BROOK 11794 8091 NY US Student in an Organized Health Care Education/Training Program T N 11 11 195.82 123 11 * * * * 0 0 * * * * 0 0 0 0 0 0 0 0 0 0 0 56 11 0 0 0 0 1.0302
The file uses \r\n
line endings, and I can't find any line longer than the header row, so I don't know why there's the error "CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE." Anyway, I tried adding --snifflimit 0
to disable any incorrect dialect sniffing:
csvsql --snifflimit 0 -t --insert --db sqlite:///dummy.db PartD_Prescriber_PUF_NPI_15.txt
Now it seems the whole file was imported!
Please also consider the use-case where csvlook is being piped the output of some other program that takes a while to run. I was hoping that csvlook would be very useful for us to format the output so it can be read easily within the terminal, but this limitation makes it useless.
@coxymla If a process takes a while to run, it's generally a good idea to redirect the output to a file, to avoid losing that time.
csvlook can't print ragged rows presently, but if you don't mind losing the extra cells in rows that are longer than the header row, you can run cat your.csv | csvcut -C "" | csvlook
Thanks for the reply; the problem is that typically the first lines are the copyright notice, progress statements, etc. until the real header is finally printed out afterwards somewhere in the middle, finally followed by the rows of data.
@coxymla If you know the number of lines that appear at the top of the file, you can try using --skip-lines
.
~~Reminder: Upstream issue is https://github.com/wireservice/agate/issues/666~~
@jpmckinney I also had this issue, and put together the following Python 3 script which "sniffs" the first n lines (which you can change), decides on the max, and squares up the data with trailing commas. It's pretty fast, simple, takes no command-line arguments, and expects to be run as a filter, e.g.
./csvfix.py <myfile.csv
but it copes with enormous files, as it does not hold much in memory. https://gist.github.com/tuck1s/ed1c20ccb51b860b04dd838d7057364d
@tuck1s You can achieve the same result with csvtool
from the OCaml-CSV package:
csvtool setcolumns $(head -1000 "$inp" | csvtool width -) "$inp"
The performance should be similar to your solution (uses the same logic), but you have to have the input as a file in variable $inp
.
Changing agate
is not looking likely. Instead, I documented all the solutions in this thread at https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html#examples
edf0de7