Rui Meng

Results 10 issues of Rui Meng

## Describe the bug I use load_dataset() (I tried with [wiki](https://huggingface.co/datasets/wikipedia) and my own json data) and use set_transform/with_transform for preprocessing. But it hangs at the end of the 1st...

bug

I usually cite papers when posting things, and I wish to maintain independent bib file for each posting (to not mix up with each other). However I didn't find a...

PR welcome

@francoishernandez @pltrdy I ran into an error after updating this commit. I ran Transformers on V100 and I tried with 1,2,4 GPUs. It works well if **I use apex_opt_level=O0** but...

type:bug

Hi there, Thank you for maintaining this excellent repo about keyphrase datasets. I am checking the dataset 500N-KPCrowd-v1.1 and I find there are some files that are improperly truncated (only...

I found BEIR consumes huge amount of memory when evaluating with large datasets such as HotpotQA and MSMARCO. [L73-L78](https://github.com/beir-cellar/beir/blob/1a1e6aba14b6a2ba1f261a003e7c1ea46fd61564/beir/retrieval/search/dense/exact_search.py#L73) could be optimized since `self.results` will keep accumulating new scores, whereas...

**Describe the bug** Don't know why using Stage 3 triggers this error `RuntimeError: ProcessGroup nccl does not support _reduce_scatter_base`. Training with Stage 2 is fine. Is this related to my...

bug
compression

Hi there, Great tool! I wonder if it is possible to load/dump configs as a list of nested objects/dataclasses, like the data shown below? It's quite common for ML projects....

question
pending author response

Hi, I really like this study because of its neat idea and very comprehensive comparisons. I intend to reproduce your results in the paper, but I found something necessary is...

Hi there, It is observed that MindSmallReranking is extremely slow (also mentioned in https://github.com/embeddings-benchmark/mteb/issues/381). I checked the data and found there is a waste in encoding. There are 70,938 records...

enhancement

### Feature request interleave_dataset and [RandomlyCyclingMultiSourcesExamplesIterable](https://github.com/huggingface/datasets/blob/3813ce846e52824b38e53895810682f0a496a2e3/src/datasets/iterable_dataset.py#L816) enable us to sample data examples from different sources. But can we also sample batches in a similar manner (each batch only contains data...

enhancement