Vladimir Rudnykh

Results 47 comments of Vladimir Rudnykh

> E.g. do we even need prefetch if file is local? We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is...

> in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume Yes, exactly. This is still local file, but cache/prefetch might be...

> What will users see when they run UDF with some bad files after this change? They will see an exception: ``` datachain.lib.file.FileError: Error in file gs://datachain-test-vlad/.: path must not...

Closing this issue because we decided to do not preserve ordering within dataset.

Intermediate results: `group_by.py`: ```python import os from datachain import DataChain, func def path_ext(path): _, ext = os.path.splitext(path) return (ext.lstrip("."),) ( DataChain.from_storage("s3://dql-50k-laion-files/") .map( path_ext, params=["file.path"], output={"path_ext": str}, ) .group_by( total_size=func.sum("file.size"), cnt=func.count(),...

Merged. Closing this issue as work will continue in the follow-up https://github.com/iterative/datachain/issues/523 issue.

Quick note: I have checked AWS S3 and it returns public URL out of the box if no credentials found: ```python In [1]: from datachain.catalog import get_catalog In [2]: catalog...

> How realistic it is to get this information from the real DBs. Easy. ### SQLite We can use [dbstat](https://www.sqlite.org/dbstat.html) to get table size: ``` sqlite> SELECT SUM("pgsize") FROM "dbstat"...

So we do have `num_objects` and `size` fields in `dataset_version` table ([source code](https://github.com/iterative/datachain/blob/1de7bd322b1e5117553c404df3bf65f4ed23f91b/src/datachain/data_storage/metastore.py#L541-L542)). [Here](https://github.com/iterative/datachain/blob/1de7bd322b1e5117553c404df3bf65f4ed23f91b/src/datachain/catalog/catalog.py#L1130-L1171) we have a method to update these fields. And [here](https://github.com/iterative/datachain/blob/main/src/datachain/data_storage/warehouse.py#L394-L414) is the code for getting these...

> PS: it looks like there are 3 issues in this one :) 🤔 Indeed. Let me split it.