Quentin Lhoest comments

Results 416 comments of


                                            Quentin Lhoest

Improve search functionality

Just did a quick test from one of the current prod machines on this [C4 index.duckdb file](https://huggingface.co/datasets/c4/blob/refs%2Fconvert%2Fduckdb/en/partial-train/index.duckdb) (8GB) Download from HF using hf_transfer to local disk: ~40sec Copy from shared...

Improve search functionality

Sounds good, this would provide a local fast disk and fix the /search speed issues. Let's check with someone from infra to validate and check the prices

Rate-limit the updates?

Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced...

Rate-limit the updates?

(until backfill does its job maybe ?)

Return partial dataset-hub-cache instead of error?

Maybe we can just add a try/except in compatible-libraries and return something empty + the reason why it's empty

Load a parquet export with `pyarrow.parquet.ParquetDataset`

What would be the best way to move the duckdb indexes (to a new branch or a new directory) ? Shall we increment the version of the duckdb-index job with...

Use specific stemmer by dataset according to the language

Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start...

autoconverted parquet file has too big cells

`UnexpectedApiError ` for https://huggingface.co/datasets/danielz01/landmarks ``` libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB) ```

autoconverted parquet file has too big cells

Same `UnexpectedApiError` for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows

autoconverted parquet file has too big cells

For the UI the best is to truncate, and a bonus would be to let the user click to expand a row