Ivan Shcheklein

Results 95 issues of Ivan Shcheklein

Since we list versions (do we?) we can use the first version in a versioned bucket as a created date. Let's add it to the File object.

enhancement

It's quite common that I these days do this: ```python if "dclm-raw-text" not in datasets: ( DataChain.from_dataset("dclm-index") .settings(cache=True) .limit(1) .gen(extract, output={"file": File, "json": dict}) .save("dclm-raw-text") ) ``` to avoid running...

Followup https://github.com/iterative/datachain/pull/755 Less critical of implementation since it affects only public, no credentials buckets and Studio teams. It works already for Google Storage since @dreadatour fixed it a while ago....

bug
enhancement

Context: https://arxiv.org/pdf/2406.11794 and https://www.datacomp.ai/dclm/ DCLM download is covered. Next steps are: - [ ] research how they do deduplication - [ ] apply model based - filtering - [ ]...

enhancement

Come up with higher level LLM UDF. When analyzing data via LLMs (text, images), step by step we have quite a lot of repetitive code like: ```python def extract_performance(chunk: Chunk)...

enhancement
triage

Motivated by this: https://github.com/iterative/datachain/issues/510#issuecomment-2413861745 > I'm not sure it should be allowed at all - to have a nested column meta and a regular field meta at the same time....

enhancement

# Bug Report ## Description `dvc list -R . . ` lists `.env` even if it is part of the `.gitignore` It's related to https://github.com/iterative/dvc/issues/5712, and we should understand what...

product: VSCode
A: status

Fix resources cleanup. Mostly to avoid tons of warnings in tests, but also probably avoid some leaks.

``` DB::Exception: There is no supertype for types String, Int64 because some of them are String/FixedString/Enum and some of them are not: JOIN INNER JOIN ... ON PKgFxBXUIkWjxKAV.file_id = avtEcmyNKoXggsXh.file_id...

A query like this doesn't work w/o persist on sqlite: ```python read_dataset("test") .distinct("file.path") .group_by(cnt=func.count(), files=func.collect("file.path"), partition_by=("session_id", "position")) .persist() .filter(C("cnt") > 1) ``` It raises: ```python in/data_storage/sqlite.py", line 242, in execute...

bug
priority-p1