rio
rio copied to clipboard
Function to just read variable names/metadata
It might be useful to be able to just read metadata without loading an entire file. I don't think there's a way to do this consistently across file types, though.
Would a good way to handle that be identifying all methods that has a "limit"-style parameters to grab a certain percentage/number of lines and then just do a transform to return a data.frame with column index, column name, and column type of the resulting data.frame?
That could work, but what formats will that work for other than plain text?
I'll have to peak at what binary formats may or may not have some form of quick access. Unfortunately the benefit of faster loading, smaller file size, and consistent type definitions with binary files probably means little performance benefit for slicing data since it's stored in a non-adjacent ways. We could have a consistent interface though that's simply slower on binary (but more accurate).
It may make sense to make a limit
param that is overridden where that's not usable (and you get a message that limit = Inf essentially). That would also allow people to trade speed for accuracy when using techniques that are doing type-guessing.
We could probably also take a look at the haven codebase and see if there's way to add some of this functionality upstream. I haven't looked but I imagine it's possible as a lot of these formats have metadata the beginning of the file before any of the actual contents start.
What I think would be nice is if among the meta-data snooping functions there was one that listed tables/sheets/etc. It would return names if available, numbers otherwise. For formats that cannot have multiple tables or sheets it would always return 1
. What do you think?
How about we make a generic like get_col_names()
and start creating methods for it. For some of these (like text/tabular), this is going to be easy. We should also be able to make it work with the new haven functionality in #248. It doesn't have to work for everything right away as the long tail of formats isn't going to give us this metadata easily.
@bokov I'm not sure how useful that is, at least initially. Let's start with the simple/flat file types and then think about how much work is worthwhile to make it work for other file types.
I agree about starting with simple/flat file types and not necessarily supporting all types. But what you think about the idea of following what seems to be the overall philosophy of this package and writing unified front end functions that does this for the supported formats (and some kind of message for unsupported ones). Instead of exporting format-specific functions.
Yea, that's what I meant. Sorry for not being clear - we wouldn't export the methods, just like we don't export import/export methods.