olmocr
olmocr copied to clipboard
Bump datasets from 3.0.0 to 3.2.0
Bumps datasets from 3.0.0 to 3.2.0.
Release notes
Sourced from datasets's releases.
3.2.0
Dataset Features
- Faster parquet streaming + filters with predicate pushdown by
@lhoestq
in huggingface/datasets#7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by
@lhoestq
in huggingface/datasets#7272- Add link to video dataset by
@NielsRogge
in huggingface/datasets#7277- Raise error for incorrect JSON serialization by
@varadhbhatnagar
in huggingface/datasets#7273- support for custom feature encoding/decoding by
@alex-hh
in huggingface/datasets#7284- update load_dataset doctring by
@lhoestq
in huggingface/datasets#7301- Let server decide default repo visibility by
@Wauplin
in huggingface/datasets#7302- fix: update elasticsearch version by
@ruidazeng
in huggingface/datasets#7300- Fix typing in iterable_dataset.py by
@lhoestq
in huggingface/datasets#7304- Updated inconsistent output in documentation examples for
ClassLabel
by@sergiopaniego
in huggingface/datasets#7293- More docs to from_dict to mention that the result lives in RAM by
@lhoestq
in huggingface/datasets#7316- Release: 3.2.0 by
@lhoestq
in huggingface/datasets#7317New Contributors
@ruidazeng
made their first contribution in huggingface/datasets#7300@sergiopaniego
made their first contribution in huggingface/datasets#7293Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0
3.1.0
Dataset Features
- Video support by
@lhoestq
in huggingface/datasets#7230>>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70>
- Add IterableDataset.shard() by
@lhoestq
in huggingface/datasets#7252>>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1
... (truncated)
Commits
fba4758
Release: 3.2.0 (#7317)8983782
More docs to from_dict to mention that the result lives in RAM (#7316)661d7ba
Faster parquet streaming + filters with predicate pushdown (#7309)b60ebb8
Updated inconsistent output in documentation examples forClassLabel
(#7293)c9d3450
Update iterable_dataset.py (#7304)38d648e
fix: update elasticsearch version (#7300)c8252f2
Let server decide default repo visibility (#7302)06c3235
update load_dataset doctring (#7301)17f17b3
support for custom feature encoding/decoding (#7284)2049c00
Raise error for incorrect JSON serialization (#7273)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
-
@dependabot rebase
will rebase this PR -
@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it -
@dependabot merge
will merge this PR after your CI passes on it -
@dependabot squash and merge
will squash and merge this PR after your CI passes on it -
@dependabot cancel merge
will cancel a previously requested merge and block automerging -
@dependabot reopen
will reopen this PR if it is closed -
@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually -
@dependabot show <dependency name> ignore conditions
will show all of the ignore conditions of the specified dependency -
@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)