csvkit icon indicating copy to clipboard operation
csvkit copied to clipboard

Allow disabling/forcing type inference for certain columns only

Open onyxfish opened this issue 12 years ago • 11 comments

A mapping?

--types int,varchar

A .csvt?

onyxfish avatar Mar 01 '12 01:03 onyxfish

A CSVT is might be a little exotic, but might be most robust solution if you do the same tasks over and over. The example I had so far was that csvsql made something a date field that was a varchar, and I couldn't really get it to do what I wanted.

Could you just specify which fields to not guess, defaulting to varchar?

CSVT would also make you specify all columns, right? That would be daunting on a big dataset, probably defeat the purpose.

I think I like the "Don't guess on this column" option most.

mikejcorey avatar Mar 01 '12 01:03 mikejcorey

The latter is certainly a possibility although I'm inclined to implement a more general solution if one exists. I like .csvt because 1) it's still CSV and 2) it's an existing (albeit, as you say, exotic) convention. The somewhat annoying thing about it is that I'll be mandating a pretty specific list of supported Python types, which aren't going to match any other type system out there in the world.

Internally csvkit normalizes to:

NoneType, bool, int, float, datetime.datetime, datetime.time, datetime.date and unicode

onyxfish avatar Mar 01 '12 01:03 onyxfish

Am I right that you'd have to specify all columns if you went the .csvt route?

mikejcorey avatar Mar 01 '12 01:03 mikejcorey

That's true, that is def. a downside. Maybe a

--no-infer a,b,c

syntax is better after all.

It's also worth keeping in mind that for type coercion things can really only be cast "down", i.e. int -> unicode. If you were to try to use a csvt to specify a more granular type the thing would just blow up anyway.

onyxfish avatar Mar 01 '12 01:03 onyxfish

Yeah, I think that's OK -- it's more important to me that something fails over to generic rather than specific. So if I have to CAST (blah) AS INTEGER, that's no big deal.

Supporting .CSVT might be a nice feature as well, but would not really solve my particular problem, which I think I'd come come across more often if my main use is to quickly start playing with some data.

In any case, csvsql is really cool. Navicat is obviously good at CSV imports, but still requires some configuration guesswork. It's a huge timesaver and probably nearly eliminates the need for certain types of users in our organization to even use Navicat, which would definitely save us some money.

mikejcorey avatar Mar 01 '12 01:03 mikejcorey

That's wonderful to hear. I'll look at hacking in a way of force values to strings sometime soon (possible tonight, though I'm down other rabbit holes at the moment). Thanks for the feedback!

onyxfish avatar Mar 01 '12 01:03 onyxfish

Great, thanks! No rush from me, just wanted to say something while I was thinking of it.

mikejcorey avatar Mar 01 '12 02:03 mikejcorey

Noting that there's some discussion of possible solutions in the referenced issues above.

jpmckinney avatar Jan 25 '16 17:01 jpmckinney

So I think the simplest satisfactory solution for the reported feature request is to allow --no-inference to accept column names, e.g.:

--no-inference a,b,c

jpmckinney avatar Jan 28 '17 18:01 jpmckinney

is there a realistic plan to do this?

mingfang avatar Jun 27 '22 16:06 mingfang

There is no time planned to work on this issue. It remains open.

jpmckinney avatar Jun 28 '22 14:06 jpmckinney