ideas
ideas copied to clipboard
[discuss] Guess if a CSV File is CSV or Not
What: we want this to be a simple library (likely node and/or python) for guessing whether a given file is csv or not.
Why (is this not trivial): CSV is plain text and has no special markers. Since ',' (or even tabs) turn up everywhere (in HTML, PDF, Word) and CSV can have various file encodings it is not simple to give a confidence estimate as to whether a given byte stream is CSV or not.
Please add your thoughts here on features, design and research
Research
-
magic
(libmagic) - is that useful here?
See Also
- https://github.com/okfn/ideas/issues/71 (good data toolchain)
Anything in csvkit that can help?
@brew good suggestion but not specifically to my knowledge. My guess here is that you'll want to do a tiny bit of statistical analysis (similar but different to messytables). e.g. attempt to parse this as CSV and then based on output make a guess as to whether CSV (e.g. if you get columns with huge number of characters or massively varying numbers of characters that would imply you are just parsing something that is not CSV but thinking it is ...).
What's with CSV Lint? Thought about using this in the context of Farmsubsidy/Openspending?
@holgerd77 i believe that CSVLint is more of a validator of a CSV against a given schema - not checking if a CSV (i.e. more like https://github.com/okfn/json-table-schema-validator).
Hmm, I didn't dig into it very much, but from the README.md there is also various basic validation provided, below is an extract from error types detected. I think I'll definitely have a look on this a bit more closely.
Errors
The following types of error can be reported:
-
:wrong_content_type
-- content type is nottext/csv
-
:ragged_rows
-- row has a different number of columns (than the first row in the file) -
:blank_rows
-- completely empty row, e.g. blank line or a line where all column values are empty -
:invalid_encoding
-- encoding error when parsing row, e.g. because of invalid characters -
:not_found
-- HTTP 404 error when retrieving the data -
:stray_quote
-- missing or stray quote
...
@holgerd77 right - those are actual issues with the CSV itself (as tabular data - i.e. you have blank rows at the top etc). It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...
@holgerd77 you may also want to check out https://github.com/okfn/json-table-schema-validator which is has some similarities to csvlint but is library-only, nodejs version.
magic (libmagic) - is that useful here?
and
It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...
so essentially, a wrapper around file(1) like https://pypi.python.org/pypi/filemagic/1.6 ?
@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)
@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)
a quick ack on the latest source for file doesn't yield any results for 'csv'… odd.
There is also Apache Tika (written in Java) which has bindings in Python and Node.
Question:
Do we want to know whether a given file is CSV, or, if the contents of a file can be read as CSV?
Example:
If we detect XLS, open a sheet, and can parse the stream as CSV: is it CSV?
So, using libmagic (based on some checks I've done with the Python bindings) is quite useful for picking up the obvious stuff from a file.
One important point is that, when directly passed a stream which is valid CSV, libmagic can only identify it as plain text.
So, in cases where file extension and mime-type are wrong (say, plaintext .txt) yet the streamed data is CSV, we'd get a false negative.