added connector folder and HF file
HF file takes in any Huggingface identifier and then returns an AtlasDataset
Updates:
- Doesn't parse through error for config
- Updated example usage
- Updated the way it handles lists
Testing:
- 20 HF datasets with a couple of larger ones, and most of them are smaller ones
Limitations:
- Audio files
- Image files
| :rocket: | This description was created by Ellipsis for commit 9ae14f42ed2de9aad1386f93bf75883349eb6b6c |
|---|
Summary:
Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.
Key points:
- Introduced a new connector for Hugging Face datasets.
- Added
connectors/huggingface_connector.py. - Implemented
connectors/huggingface_connector.get_hfdatato load datasets and handle configuration issues. - Added unique IDs to each dataset entry using a sequential counter.
- Implemented
connectors/huggingface_connector.hf_atlasdatasetto create anAtlasDataset. - Included data processing functions
connectors/huggingface_connector.convert_to_stringandconnectors/huggingface_connector.process_table. - Used Apache Arrow for data processing.
- Included a command-line interface in
connectors/huggingface_connector.py. - Updated
connectors/__init__.pyandexamples/HF_example_usage.py. -
add_dataaccepts arrow tables directly. - Made an interactive script using argparse in example file.
- Tested with ~80 different datasets, including small and large datasets.
- Works for text, lists, booleans, numbers, special symbols, file paths, columns with special characters.
- Limitations: Does not support images or audio.
Generated with :heart: by ellipsis.dev
Loads dataset using load_dataset library and assigns a unique ULID ID for each entry
(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these
The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big
Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up
When using map_data the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here: https://github.com/nomic-ai/nomic/blob/1f042befc53892271bd0a0877070d47b2d3cb631/nomic/atlas.py#L77
Tested with datasets smaller than 10k for speed but can work with larger datasets
I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]") to only get the first 100k rows.
Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming
here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398
Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.
current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?