datadex icon indicating copy to clipboard operation
datadex copied to clipboard

Implement glob patterns on IPFS

Open davidgasquez opened this issue 2 years ago • 3 comments

For large datasets stored as multiple parquet/CSV files, it would be much better to have a glob pattern than to write multiple union all.

davidgasquez avatar May 30 '22 15:05 davidgasquez

Could perhaps be used with an S3 interface to IPFS.

davidgasquez avatar Jun 20 '22 07:06 davidgasquez

Another alternative is to mount IPFS as a local FS directory and use that. Kubo can do that, and these other projects might help:

  • https://github.com/SupraSummus/ipfs-api-mount
  • https://github.com/TheDiscordian/ipfs-sync

davidgasquez avatar Sep 08 '22 07:09 davidgasquez

In theory, it should be possible to use fsspec IPFS implementation to initialize a PyArrow dataset. In practice, it fails. :sweat_smile:

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import PyFileSystem, FSSpecHandler
import ipfsspec
import duckdb

fs = ipfsspec.IPFSFileSystem()
pa_fs = PyFileSystem(FSSpecHandler(fs))

con = duckdb.connect()

sc = pa.schema([("year", pa.int16()), ("month", pa.int16()), ("day", pa.int16())])

data_schema = pa.schema(
    [
        ("height", pa.int64()),
        ("miner_id", pa.string()),
        ("sector_id", pa.string()),
        ("state_root", pa.string()),
        ("event", pa.string()),
        ("year", pa.int16()),
        ("month", pa.int16()),
        ("day", pa.int16()),
    ]
)

part = ds.partitioning(schema=sc, flavor="filename")
dataset = ds.dataset(
    "bafybeib5yuwr3hmbhw73gizhnsl5pje3cvdogwbrtvivyg53odhsabtdwe",
    filesystem=pa_fs,
    format="csv",
    partitioning=part,
)

davidgasquez avatar Sep 08 '22 07:09 davidgasquez

It works when using https://github.com/AlgoveraAI/ipfspy!

davidgasquez avatar Sep 26 '22 14:09 davidgasquez