vortex
vortex copied to clipboard
Vortex should support reading from the Hugging Face Datasets API
For example, the following should Just Work.
import vortex as vx
url = "hf://datasets/danking00/statpopgen-benchmark/10000/vortex-file-compressed/gnomad.genomes.v3.1.2.hgdp_tgp.chr21.vortex"
f = vx.open(url)
arrays = list(f.scan())
array = vx.io.read_url(url)
This works fine for Parquet files:
import pyarrow.dataset as ds
import pyarrow.parquet as pq
table = pq.read_table("hf://datasets/danking00/statpopgen-benchmark/10000/parquet/gnomad.genomes.v3.1.2.hgdp_tgp.chr21.parquet")
dataset = ds.dataset(
"hf://datasets/danking00/statpopgen-benchmark/10000/parquet/gnomad.genomes.v3.1.2.hgdp_tgp.chr21.parquet",
format="parquet",
)
scanned = dataset.to_table()
See also
https://github.com/huggingface/datasets/issues/7863
Ideally this would also work with Polars, DataFusion & DuckDB.