DataToolkit.jl icon indicating copy to clipboard operation
DataToolkit.jl copied to clipboard

Multi-file loader

Open jfb-h opened this issue 1 year ago • 4 comments

As recently discussed on Zulip, it would be nice to have a loader which allows loading multiple files that have the same schema, which is already supported by e.g. CSV.jl or Arrow.jl. So I thought I'd make an issue to track this :)

jfb-h avatar Feb 17 '24 14:02 jfb-h

Thanks for the issue, it will probably take a while for me to get to this properly, but for the record this is rolling around in the back of my mind.

I want to handle this, but also handle it properly (use a cached merkle-tree hash for starters, but more thought is needed).

tecosaur avatar May 16 '24 09:05 tecosaur

I'm thinking more on this, and specifically having a directory. I'm wondering if introducing a DirPath as a counterpart to FilePath could be a good way of handling this.

tecosaur avatar May 22 '24 09:05 tecosaur

That sounds sensible. Would you then chain a directory loader and a specific file loader? Or would you just pass the directory to a loading function which is then free to process its contents in any way?

jfb-h avatar May 22 '24 14:05 jfb-h

We now have DirPath! :partying_face:

This is a big step, and it's been done properly: merkle tree hashing for integrity, with caching to avoid long waits for repeated work on each access/check.

Now we have an easy way to arrive at a collection of items, we can start thinking about the next step: how to handle them in bulk...

tecosaur avatar Jun 16 '24 11:06 tecosaur