streaming
streaming copied to clipboard
Support for Lance
🚀 Feature Request
I've seen the lance format mentioned sometimes in the codebase, and there was a talk recently about lance and data bricks working together to integrate streaming dataset with the lance format, I'm curious what the status of this effort is?
Motivation
The motivation is pretty much laid out in the video https://www.youtube.com/watch?v=MwF9aTnZmN4 that you guys gave.
[Optional] Implementation
Additional context
Hey @oceanusxiv, thanks for reaching out to us. We haven't started any active development yet. I am curious about your use case. Can you share more details on it? For example, why Lance and why not MDS? How does Lance + Streaming Dataset help your use case?
@karan6181 Hi, yes, basically MDS as a format is a bit too specialized for training and training only. It's hard to inspect or transform the data once it has been converted. Lance has much more robust tooling (integration with polars, fast random access etc) which facilitates better data exploration and processing. MDS also cannot do any filtering after conversion. The use case is really well laid out in https://www.youtube.com/watch?v=MwF9aTnZmN4.
OTOH, Lance doesn't come with very good support for multi-dataset weighted sampling, and fast resumption of DataLoader, it also has performance issues if you're not iterating through the fragments directly (which I cannot because my data has some temporal correlation and I need to fetch multiple rows per sample). Streaming Dataset does provide these facilities.
@oceanusxiv Definitely agreed on all these points. While Lance integration isn't currently on our roadmap, if we have more interest from internal customers on this front, this would definitely be a great feature to have for all the reasons you've stated.
Enabling another file format may not be too difficult, if you want to take an initial pass yourself -- you would want to make classes similar to MDS Reader and Writer but for the lance format instead.
I am also interested in some form of integration between LanceDB + Mosaic Streaming.
+1, Is there anyone willing to implement this part? It sounds very useful.