semantic-csv
semantic-csv copied to clipboard
Create column type "sniffer"
Would look through and make a guess about what type each column should be. Should be triggered via a :sniff true
flag, and be overrideable by casters specified in :cast-fns
.
@metasoarous - I am starting to work/think on this. With regards to primitive type - are you ok will defaulting to using long
for integers, and double
for all floating point numbers? Do we want to sniff out Dates? Do we want to sniff out whether or not there is a header column?
Awesome! Thanks for starting on this.
Casting to long
and double
is the default I'd like.
Dates would be nice, though it's perhaps a little trickier, since there are so many formats, and even interpretation disagreement between regions (e.g. MM/DD//YY in US vs YY/MM/DD in EU and elsewhere)... So maybe save dates for last.
I think let's leave header out of the picture here; I think it's best that stay explicit. But let's make sure to keep the ignore first option, and try to make things work on both vectors and maps, like some of the other processing functions.
In general, I'm envisioning a hierarchy of types. For instance a column could start off looking like long
, but then have a double
somewhere in the first n
rows. All should be cast to double
in that case. Deciding between YY/MM/DD vs MM/DD/YY could be done similarly, since 01/13/09 would make sense as the latter, but not the former. Would need a default in that case though, if all sniffed values were ambiguous (EU makes more sense, IMO).
Something else to think about is how to make this very extensible/customizable. What could it look like to let people specify their own sniffers for custom types? How would this play with the default sniffers? I've put off thinking about this at anything more than a very abstract level, but I'd be game to brainstorm something more concrete over chat or back and forth here if you like.
Cheers
I am planning on using regular expressions to test to values. It might suffice to have a collection of maps like '({:re #"^\d+$" :cast-fn 'semantic-csv.core/->long} ...)
, which could be updated via input parameters, allowing the caller to specify a collection of maps that could be used in the sniffer.
I agree with your comment about dates. We can add that downstream.
I can chat most days. I have been pairing at work a good bit lately, but when I am at my desk, I usually have gitter and slack running.
Sorry for the late response...
That's not a bad idea. But we want to think about how we specify hierarchies. Perhaps the map collection schema you suggest could be modified by having a :name
attribute. This would let you refer to the :name
s in a separate argument which specifies the hierarchy. And keep in mind that it would be great if their were nice semantics for only overriding/customizing certain parts of the parsing scheme. For instance, it would be nice if you could pass {:name :decimal :cast-fn #semantic-csv.core/->float}
if you wanted to change the parsing function but leave the regular expression unchanged. Also keep in mind that we want users to be able to specify scientific notation for their numerics (anything that would be supported by Long/parseLong
, etc.).
Prioritizing casting is currently something I am trying to decide on. It may be that there is a collection of maps with key value pairs, containing a name, regex, and cast function as keys. And then a map with keys being the type (represented by a keyword), with a vector as a value, that will define the hierarchy for casting. For example:
(def casters
[{:name :float :re #"^\d+$" :cast-fn 'semantic-csv.core/->long} ...])
(def cast-priority
{:numeric [:float :double ...]
:date [:mmddyyyy :ddmmyyyy :yymmdd ...]})
With what you are saying regarding scientific notation, it may be nice to have an optional formatter, ala: [{:name :float :re #"^\d+$" :cast-fn 'semantic-csv.core/->long :formatter "%e"} ...]
. Having a formatter will be really nice I think. It would be useful for both numerics, as well as date/time.
Does this make sense?
I'm not sure that formatting needs to fit into this part of the application. I only really envisioned this having an effect on the reading of data in, not data out (which is where I see :formatter
being most useful, unless I'm misreading you). Still, I guess it's something to consider.
What I meant about scientific notation is that we want to make sure our regular expressions are able to capture those particular string representations. And while I like the idea of regular expressions, we might also want to support general :match-fn
s, for added flexibility (have to decide on a default in case both are present in the map though).
As for the rest of your data specification, I think you're generally on the right track. But let's be careful to separate the abstract type from representation. As a case in point, I don't think there should be separate entries for :float
and :double
. There should be a single entry for :decimal
, since this is really the abstract, mathematical type of interest; Float
and Double
are really computational representations, in my mind (albeit with limitations). Representation should be handled via :cast-fn
: whatever concrete type gets returned by that function should be the representation type. This is nice, because it let's us do what the Clojure reader does; small integers are Long, but bigger ones are BigInt. We can try/catch here as the default.
In that picture, I think we'd just have something like a set of implications for specification of the heirarchy. An :integer
is a :decimal
, but not always the other way around. Just as a :decimal
is always a :rational
, but not always the other way around. Perhaps we should just have a collection of tuples:
(def cast-heirarchy
[[:integer :decimal]
[:decimal :rational]
...]
The situation with dates is a little more complicated however, because it's a all the same abstract type, just differences about how we'd like to prioritize the interpretation of a string as said type. For that I think your scheme makes sense. It seems like we might have to have three parts to the full specification then. What do you think?
That all makes sense. I totally agree with the differentiation between representation and abstraction. I have been thinking about this recently (because I have had little time to work on it lately). I think I am going to start with some implementation functions, one for vectors, and one for maps, that take a row and test the values in each row and return the same collection type, where the new values are the cast predictions. So for example - [1 "bob" 100.3]
might return [:integer nil :decimal]
. And the analogous map example - {:index 1 :name "bob" :value 100.3}
might return {:index :integer :name nil :value :decimal}
. The significance of nil
in these cases is that there is no casting needed/determinable - essentially, leave this index alone. The reason for this is that I want to separate the actual casting step and the testing step. This might serve valuable when creating transducers. Also, the cast type to an index/key should make the actual casting code simpler. The functions might look like:
;; used for testing values in rows/vectors
(defn sniff-value [val] ...)
(defn sniff-test-vector [row-vec previous-test] ...) ;; => [...]
(defn sniff-test-map [row-map previous-test] ...) ;; => {...}
;; Test rows with this function
(defn sniff-test [data test-rows-count]
(let [test-fn (if (map? (first data)) sniff-test-map sniff-test-vector)
test-data (take 10 data)]
(do tests here))) ;; => [...] or {...} depending on data structure of rows
The return of the sniff-test
function could then be leveraged for casting. I figure seperating out the concerns might be more efficient for reuse. Mostly I want to make sure there is room to integrate with transducers in the future. Internally, this would leverage the previously mentioned cast-hierarchy
etc. Does this make sense? Criticism? I am going to try to get something put together soon, because looking at code is sometimes easier than talking about what it might look like.
That sounds perfect. I was thinking the exact same thing, actually. I'm not sure whether we should actually expose sniff-test
in your scheme there as part of the public API though. I could probably be talked into it. That would let you get the results and potentially override something via assoc
, then pass that as an arg to cast-with
. That's pretty composable. If we do that though, I think it would still be nice to have a sniff-cast
function that does all the work for you, and accepts overrides.
Looking forward to seeing what you put together!