nomic icon indicating copy to clipboard operation
nomic copied to clipboard

added connector folder and HF file

Open abhisomala opened this issue 1 year ago • 3 comments

HF file takes in any Huggingface identifier and then returns an AtlasDataset

Updates:

  • Doesn't parse through error for config
  • Updated example usage
  • Updated the way it handles lists

Testing:

  • 20 HF datasets with a couple of larger ones, and most of them are smaller ones

Limitations:

  • Audio files
  • Image files

:rocket: This description was created by Ellipsis for commit 9ae14f42ed2de9aad1386f93bf75883349eb6b6c

Summary:

Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.

Key points:

  • Introduced a new connector for Hugging Face datasets.
  • Added connectors/huggingface_connector.py.
  • Implemented connectors/huggingface_connector.get_hfdata to load datasets and handle configuration issues.
  • Added unique IDs to each dataset entry using a sequential counter.
  • Implemented connectors/huggingface_connector.hf_atlasdataset to create an AtlasDataset.
  • Included data processing functions connectors/huggingface_connector.convert_to_string and connectors/huggingface_connector.process_table.
  • Used Apache Arrow for data processing.
  • Included a command-line interface in connectors/huggingface_connector.py.
  • Updated connectors/__init__.py and examples/HF_example_usage.py.
  • add_data accepts arrow tables directly.
  • Made an interactive script using argparse in example file.
  • Tested with ~80 different datasets, including small and large datasets.
  • Works for text, lists, booleans, numbers, special symbols, file paths, columns with special characters.
  • Limitations: Does not support images or audio.

Generated with :heart: by ellipsis.dev

abhisomala avatar Jul 08 '24 18:07 abhisomala

Loads dataset using load_dataset library and assigns a unique ULID ID for each entry

(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these

The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big

Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up

When using map_data the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here: https://github.com/nomic-ai/nomic/blob/1f042befc53892271bd0a0877070d47b2d3cb631/nomic/atlas.py#L77


Tested with datasets smaller than 10k for speed but can work with larger datasets

I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]") to only get the first 100k rows.

Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming

here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398

apage43 avatar Jul 09 '24 21:07 apage43

Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.

RLesser avatar Jul 09 '24 23:07 RLesser

current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?

apage43 avatar Jul 11 '24 16:07 apage43