csvkit
csvkit copied to clipboard
Show sniffed delimiter on exception
# colA,colB
# aaaaa...aaaaa zzzzz...zzzzz \
# ... } 10 or 100 rows
# aaaaa...aaaaa zzzzz...zzzzz /
#
# \___________/ \___________/
# 1000chars 1000chars
# 10 rows
# "," is used as delimiter
python3 -c "print('colA,colB') ; [print('a'*1000 + ' ' + 'z'*1000) for _ in range(10)]" | csvstat
# => ok
# 100 rows
# " " is used as delimiter
python3 -c "print('colA,colB') ; [print('a'*1000 + ' ' + 'z'*1000) for _ in range(100)]" | csvstat
# => Row 0 has 3 values, but Table only has 2 columns.
In the latter case, sample is trimmed, losing the header colA,colB
, thus white space " " is used as the delimiter.
It was tough for me to figure out this behavior. So how about showing "what delimiter is used" in:
- Debug output
$ csvstat -v ...
inferred delimiter: ' '
- Error message
$ csvstat -v ...
Row 0 has 3 values, but Table only has 2 columns (delimiter: ' ').
and, how about showing warning of excessing SNIFF_LIMIT
?:
$ csvstat -v ...
warning: input (XXX bytes) exceeds SNIFF_LIMIT (YYY bytes), delimiter guessing may be incorrect (NOTE: SNIFF_LIMIT can be changed by -y flag)
warning: guessed delimiter: ' '
Row 0 has 3 values, but Table only has 2 columns.
Thanks - we'll try to do this as part of the next version.
Hmm, agate raises ValueError
for "Row 0 has 3 values, but Table only has 2 columns." type errors in agate/table/__init__.py
. We'd have to introduce a new error class (subclass'ing ValueError, in case anyone catches these). We'd also have to handle it all over the place, because we need access to the reader to print the dialect.
Debug output
This is a good idea. As above, we'd have to add it in a lot of places. Happy to merge a PR!
and, how about showing warning of excessing SNIFF_LIMIT?:
The snifflimit was reduced in 1.0.7 to avoid sniffing huge files (which is very slow). So, this warning would now be emitted too frequently to be useful.