lance icon indicating copy to clipboard operation
lance copied to clipboard

how to convert public dataset(like imagenet) into lance

Open autumn0207 opened this issue 2 years ago • 10 comments

hi, we are currently testing the performance of Lance in DL training, but we don't have a dataset to use as a comparison.

autumn0207 avatar Jan 12 '23 02:01 autumn0207

Hi there, we're accumulating converters for open source datasets.

So far we have ones for:

imagenet nuscenes oxford pet

https://github.com/eto-ai/lance/tree/main/python/lance/data/convert

there's also an old one for coco: https://github.com/eto-ai/lance/tree/main/python/benchmarks/coco

We're happy to build more if you have a particular dataset in mind. If you're keen to build your own, it's usually the same workflow steps:

  1. Read the metadata into a pandas dataframe
  2. Convert the dataframe into pyarrow table tbl = pa.Table.from_pandas(df, schema)
  3. Write to lance: lance.write_dataset(tbl, uri)

changhiskhan avatar Jan 12 '23 02:01 changhiskhan

If you have Discord, we can discuss live: https://discord.gg/aA8j2Eee

or send me an email to [email protected]

There's a lot of optimizations we can add for training to lance and we'd love to see your training workload to know what's most important.

changhiskhan avatar Jan 12 '23 03:01 changhiskhan

@autumn0207 hi, curious if you guys have any results to share? Happy to work with you to go faster

changhiskhan avatar Jan 18 '23 05:01 changhiskhan

@changhiskhan sure, my colleague sent you an email to consult more technical issues last week, but hasn‘t got reply yet. Can you provide us other contact information?

autumn0207 avatar Feb 08 '23 01:02 autumn0207

hi thanks for this interesting effort!

is there a way to parse which documentation/demos hold up to the latest release? e.g. the above mentioned https://github.com/eto-ai/lance/tree/main/python/lance/data/convert does not seem to exist anymore.

similarly, the demo notebooks contained in older branches dont run with the latest pip install lance (e.g. the _write_dataset function)

Screenshot_20230209_175945_Chrome

thanks for your help and if i manage to get it to run im happy to contribute the documentation(:

luisoala avatar Feb 09 '23 17:02 luisoala

@luisoala that's a good catch. Ironic that I'm the one who forgot about direct pandas integration :) Let me make a quick fix for that.

Docs - I'm planning to do a rough revamp to bring docs back online today. Then I'll ping you here for feedback?

If you prefer, email me at [email protected] or join our discord for faster responses ! Thanks!

changhiskhan avatar Feb 09 '23 19:02 changhiskhan

@changhiskhan sure, my colleague sent you an email to consult more technical issues last week, but hasn‘t got reply yet. Can you provide us other contact information?

@autumn0207 sorry dude! Which email address did you use? Send me an email at [email protected] or jump on our discord. Would be more than happy to dive into details!

changhiskhan avatar Feb 09 '23 19:02 changhiskhan

hi thanks for this interesting effort!

is there a way to parse which documentation/demos hold up to the latest release? e.g. the above mentioned https://github.com/eto-ai/lance/tree/main/python/lance/data/convert does not seem to exist anymore.

similarly, the demo notebooks contained in older branches dont run with the latest pip install lance (e.g. the _write_dataset function)

Screenshot_20230209_175945_Chrome

thanks for your help and if i manage to get it to run im happy to contribute the documentation(:

just made #543 for the fix if you'd like to take a look. i'm working on the docs now

changhiskhan avatar Feb 09 '23 19:02 changhiskhan

@luisoala that's a good catch. Ironic that I'm the one who forgot about direct pandas integration :) Let me make a quick fix for that.

Docs - I'm planning to do a rough revamp to bring docs back online today. Then I'll ping you here for feedback?

If you prefer, email me at [email protected] or join our discord for faster responses ! Thanks!

got it thanks, wasnt sure which channel you prefer to keep track of questions (joined the discord last week but have not checked it yet, will do now:)

luisoala avatar Feb 09 '23 19:02 luisoala

@autumn0207 ah btw i made a simple test notebook a few weeks back https://github.com/luisoala/lance-test/blob/main/lance_test-pen_dataset.ipynb

create on local client -> write to s3 -> query again from client -> load to client and display

its rudimentary to go through the motions (atm writing raw dng image bytes to the table which i would not recommend in practice)

luisoala avatar Mar 09 '23 00:03 luisoala

We've made a bunch of documentation improvements and have brought our docs/examples up to date. The older C++ branch contains a few extra things that we don't support or have changed since the Rust re-write. If there's anything specific you'd like please let us know

changhiskhan avatar Jul 02 '23 22:07 changhiskhan