rio icon indicating copy to clipboard operation
rio copied to clipboard

Function to just read variable names/metadata

Open leeper opened this issue 6 years ago • 8 comments

It might be useful to be able to just read metadata without loading an entire file. I don't think there's a way to do this consistently across file types, though.

leeper avatar Jan 03 '18 09:01 leeper

Would a good way to handle that be identifying all methods that has a "limit"-style parameters to grab a certain percentage/number of lines and then just do a transform to return a data.frame with column index, column name, and column type of the resulting data.frame?

jsonbecker avatar Mar 15 '18 16:03 jsonbecker

That could work, but what formats will that work for other than plain text?

leeper avatar Mar 15 '18 16:03 leeper

I'll have to peak at what binary formats may or may not have some form of quick access. Unfortunately the benefit of faster loading, smaller file size, and consistent type definitions with binary files probably means little performance benefit for slicing data since it's stored in a non-adjacent ways. We could have a consistent interface though that's simply slower on binary (but more accurate).

It may make sense to make a limit param that is overridden where that's not usable (and you get a message that limit = Inf essentially). That would also allow people to trade speed for accuracy when using techniques that are doing type-guessing.

jsonbecker avatar Mar 15 '18 17:03 jsonbecker

We could probably also take a look at the haven codebase and see if there's way to add some of this functionality upstream. I haven't looked but I imagine it's possible as a lot of these formats have metadata the beginning of the file before any of the actual contents start.

leeper avatar Mar 15 '18 17:03 leeper

What I think would be nice is if among the meta-data snooping functions there was one that listed tables/sheets/etc. It would return names if available, numbers otherwise. For formats that cannot have multiple tables or sheets it would always return 1. What do you think?

bokov avatar Dec 04 '19 23:12 bokov

How about we make a generic like get_col_names() and start creating methods for it. For some of these (like text/tabular), this is going to be easy. We should also be able to make it work with the new haven functionality in #248. It doesn't have to work for everything right away as the long tail of formats isn't going to give us this metadata easily.

@bokov I'm not sure how useful that is, at least initially. Let's start with the simple/flat file types and then think about how much work is worthwhile to make it work for other file types.

leeper avatar Dec 20 '19 14:12 leeper

I agree about starting with simple/flat file types and not necessarily supporting all types. But what you think about the idea of following what seems to be the overall philosophy of this package and writing unified front end functions that does this for the supported formats (and some kind of message for unsupported ones). Instead of exporting format-specific functions.

bokov avatar Dec 20 '19 19:12 bokov

Yea, that's what I meant. Sorry for not being clear - we wouldn't export the methods, just like we don't export import/export methods.

leeper avatar Dec 20 '19 19:12 leeper