Teven

Results 6 issues of Teven

This adds an option to launch preprocessing from an HF dataset (loaded from an arrow file for now as that's the use-case on JZ) rather than just jsonlines.

Currently, parameter counts in `utils.get_parameters_in_billions` are inaccurate when PP > 1. Tied variables, in particular embedding layers, exist in several copies in the first and last PP stage, which causes...

## Describe the bug In offline mode, one can still access previously-cached datasets. This fails with datasets created with `push_to_hub`. ## Steps to reproduce the bug in Python: ``` import...

bug

# Summary Building from source fails with `swig error : Unrecognized option -doxygen`. Weirdly, running `swig -doxygen -python -c++ -I.. faiss/python/swigfaiss.swig` works. This is swig 4.0.2. Full traceback: ``` $...

install
unconfirmed-bug

Hey, the new TEKGEN files seem to be JsonLines, much like the new KELM files. Maybe their file extensions (xxx.tsv) should reflect that, and be .jsonl instead?

### Describe the bug When I try to load parquet files that were processed with Spark, I get the following issue: `ValueError: Arrow type map does not have a datasets...