Quentin Lhoest

Results 416 comments of Quentin Lhoest

Just did a quick test from one of the current prod machines on this [C4 index.duckdb file](https://huggingface.co/datasets/c4/blob/refs%2Fconvert%2Fduckdb/en/partial-train/index.duckdb) (8GB) Download from HF using hf_transfer to local disk: ~40sec Copy from shared...

Sounds good, this would provide a local fast disk and fix the /search speed issues. Let's check with someone from infra to validate and check the prices

Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced...

(until backfill does its job maybe ?)

Maybe we can just add a try/except in compatible-libraries and return something empty + the reason why it's empty

What would be the best way to move the duckdb indexes (to a new branch or a new directory) ? Shall we increment the version of the duckdb-index job with...

Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start...

`UnexpectedApiError ` for https://huggingface.co/datasets/danielz01/landmarks ``` libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB) ```

Same `UnexpectedApiError` for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows

For the UI the best is to truncate, and a bonus would be to let the user click to expand a row