ideas icon indicating copy to clipboard operation
ideas copied to clipboard

[discuss] Guess if a CSV File is CSV or Not

Open rufuspollock opened this issue 10 years ago • 12 comments

What: we want this to be a simple library (likely node and/or python) for guessing whether a given file is csv or not.

Why (is this not trivial): CSV is plain text and has no special markers. Since ',' (or even tabs) turn up everywhere (in HTML, PDF, Word) and CSV can have various file encodings it is not simple to give a confidence estimate as to whether a given byte stream is CSV or not.

Please add your thoughts here on features, design and research

Research

  • magic (libmagic) - is that useful here?

See Also

  • https://github.com/okfn/ideas/issues/71 (good data toolchain)

rufuspollock avatar Nov 10 '14 08:11 rufuspollock

Anything in csvkit that can help?

brew avatar Nov 10 '14 10:11 brew

@brew good suggestion but not specifically to my knowledge. My guess here is that you'll want to do a tiny bit of statistical analysis (similar but different to messytables). e.g. attempt to parse this as CSV and then based on output make a guess as to whether CSV (e.g. if you get columns with huge number of characters or massively varying numbers of characters that would imply you are just parsing something that is not CSV but thinking it is ...).

rufuspollock avatar Nov 10 '14 10:11 rufuspollock

What's with CSV Lint? Thought about using this in the context of Farmsubsidy/Openspending?

holgerd77 avatar Nov 10 '14 15:11 holgerd77

@holgerd77 i believe that CSVLint is more of a validator of a CSV against a given schema - not checking if a CSV (i.e. more like https://github.com/okfn/json-table-schema-validator).

rufuspollock avatar Nov 10 '14 16:11 rufuspollock

Hmm, I didn't dig into it very much, but from the README.md there is also various basic validation provided, below is an extract from error types detected. I think I'll definitely have a look on this a bit more closely.

Errors

The following types of error can be reported:

  • :wrong_content_type -- content type is not text/csv
  • :ragged_rows -- row has a different number of columns (than the first row in the file)
  • :blank_rows -- completely empty row, e.g. blank line or a line where all column values are empty
  • :invalid_encoding -- encoding error when parsing row, e.g. because of invalid characters
  • :not_found -- HTTP 404 error when retrieving the data
  • :stray_quote -- missing or stray quote

...

holgerd77 avatar Nov 10 '14 16:11 holgerd77

@holgerd77 right - those are actual issues with the CSV itself (as tabular data - i.e. you have blank rows at the top etc). It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...

@holgerd77 you may also want to check out https://github.com/okfn/json-table-schema-validator which is has some similarities to csvlint but is library-only, nodejs version.

rufuspollock avatar Nov 10 '14 16:11 rufuspollock

magic (libmagic) - is that useful here?

and

It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...

so essentially, a wrapper around file(1) like https://pypi.python.org/pypi/filemagic/1.6 ?

adamamyl avatar Nov 10 '14 16:11 adamamyl

@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)

rufuspollock avatar Nov 10 '14 16:11 rufuspollock

@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)

a quick ack on the latest source for file doesn't yield any results for 'csv'… odd.

adamamyl avatar Nov 10 '14 16:11 adamamyl

There is also Apache Tika (written in Java) which has bindings in Python and Node.

pwalsh avatar Nov 19 '14 08:11 pwalsh

Question:

Do we want to know whether a given file is CSV, or, if the contents of a file can be read as CSV?

Example:

If we detect XLS, open a sheet, and can parse the stream as CSV: is it CSV?

pwalsh avatar Nov 19 '14 08:11 pwalsh

So, using libmagic (based on some checks I've done with the Python bindings) is quite useful for picking up the obvious stuff from a file.

One important point is that, when directly passed a stream which is valid CSV, libmagic can only identify it as plain text.

So, in cases where file extension and mime-type are wrong (say, plaintext .txt) yet the streamed data is CSV, we'd get a false negative.

pwalsh avatar Nov 27 '14 06:11 pwalsh