lance
lance copied to clipboard
how to convert public dataset(like imagenet) into lance
hi, we are currently testing the performance of Lance in DL training, but we don't have a dataset to use as a comparison.
Hi there, we're accumulating converters for open source datasets.
So far we have ones for:
imagenet nuscenes oxford pet
https://github.com/eto-ai/lance/tree/main/python/lance/data/convert
there's also an old one for coco: https://github.com/eto-ai/lance/tree/main/python/benchmarks/coco
We're happy to build more if you have a particular dataset in mind. If you're keen to build your own, it's usually the same workflow steps:
- Read the metadata into a pandas dataframe
- Convert the dataframe into pyarrow table
tbl = pa.Table.from_pandas(df, schema)
- Write to lance:
lance.write_dataset(tbl, uri)
If you have Discord, we can discuss live: https://discord.gg/aA8j2Eee
or send me an email to [email protected]
There's a lot of optimizations we can add for training to lance and we'd love to see your training workload to know what's most important.
@autumn0207 hi, curious if you guys have any results to share? Happy to work with you to go faster
@changhiskhan sure, my colleague sent you an email to consult more technical issues last week, but hasn‘t got reply yet. Can you provide us other contact information?
hi thanks for this interesting effort!
is there a way to parse which documentation/demos hold up to the latest release? e.g. the above mentioned https://github.com/eto-ai/lance/tree/main/python/lance/data/convert does not seem to exist anymore.
similarly, the demo notebooks contained in older branches dont run with the latest pip install lance (e.g. the _write_dataset function)
thanks for your help and if i manage to get it to run im happy to contribute the documentation(:
@luisoala that's a good catch. Ironic that I'm the one who forgot about direct pandas integration :) Let me make a quick fix for that.
Docs - I'm planning to do a rough revamp to bring docs back online today. Then I'll ping you here for feedback?
If you prefer, email me at [email protected] or join our discord for faster responses ! Thanks!
@changhiskhan sure, my colleague sent you an email to consult more technical issues last week, but hasn‘t got reply yet. Can you provide us other contact information?
@autumn0207 sorry dude! Which email address did you use? Send me an email at [email protected] or jump on our discord. Would be more than happy to dive into details!
hi thanks for this interesting effort!
is there a way to parse which documentation/demos hold up to the latest release? e.g. the above mentioned https://github.com/eto-ai/lance/tree/main/python/lance/data/convert does not seem to exist anymore.
similarly, the demo notebooks contained in older branches dont run with the latest pip install lance (e.g. the _write_dataset function)
thanks for your help and if i manage to get it to run im happy to contribute the documentation(:
just made #543 for the fix if you'd like to take a look. i'm working on the docs now
@luisoala that's a good catch. Ironic that I'm the one who forgot about direct pandas integration :) Let me make a quick fix for that.
Docs - I'm planning to do a rough revamp to bring docs back online today. Then I'll ping you here for feedback?
If you prefer, email me at [email protected] or join our discord for faster responses ! Thanks!
got it thanks, wasnt sure which channel you prefer to keep track of questions (joined the discord last week but have not checked it yet, will do now:)
@autumn0207 ah btw i made a simple test notebook a few weeks back https://github.com/luisoala/lance-test/blob/main/lance_test-pen_dataset.ipynb
create on local client -> write to s3 -> query again from client -> load to client and display
its rudimentary to go through the motions (atm writing raw dng image bytes to the table which i would not recommend in practice)
We've made a bunch of documentation improvements and have brought our docs/examples up to date. The older C++ branch contains a few extra things that we don't support or have changed since the Rust re-write. If there's anything specific you'd like please let us know