Quentin Lhoest
Quentin Lhoest
Just did a quick test from one of the current prod machines on this [C4 index.duckdb file](https://huggingface.co/datasets/c4/blob/refs%2Fconvert%2Fduckdb/en/partial-train/index.duckdb) (8GB) Download from HF using hf_transfer to local disk: ~40sec Copy from shared...
Sounds good, this would provide a local fast disk and fix the /search speed issues. Let's check with someone from infra to validate and check the prices
Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced...
(until backfill does its job maybe ?)
Maybe we can just add a try/except in compatible-libraries and return something empty + the reason why it's empty
What would be the best way to move the duckdb indexes (to a new branch or a new directory) ? Shall we increment the version of the duckdb-index job with...
Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start...
`UnexpectedApiError ` for https://huggingface.co/datasets/danielz01/landmarks ``` libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB) ```
Same `UnexpectedApiError` for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows
For the UI the best is to truncate, and a bonus would be to let the user click to expand a row