srobertjames
srobertjames
Is there a workaround for this? msvc, instead of linking to scanf, seems to emit a simple wrapper around vfscanf, including that wrapper in the binary (as if it was...
See https://docs.microsoft.com/en-us/previous-versions/dn727675(v=vs.140) that `__stdio_common_vfscanf` is "used to implement the CRT".
I'll add as an alternative: `firejail` works this way, but has much less security, and I believe worse performance, than firecracker.
This all makes sense. Would it be possible to include a sample script to do that? This would be very useful for many, and would help those new to firecracker...
Are you able to reproduce this? My example is small enough that it should be easy to try.
Wow, adding `return_tensors="np"` sped up my example by a **factor 17x** of and completely eliminated the casting! I'd recommend not only to document it, but to make that the default....
@lhoestq I just benchmarked the two edits to `features.py` above, and they appear to solve the problem, bringing my original example to within 20% the speed of the output `"np"`...
Is time spent casting an issue here? See https://github.com/huggingface/datasets/issues/4676 that Datasets can spend huge amounts of time repeatedly casting to Python objects.
I've built a minimal example that shows this bug without `n_proc`. It seems like it's a problem any way of using **tokenizers, `overflow_to_sample_mapping`, and Dataset.map, with a small batch size**:...
A larger batch size does _not_ have this behavior: ``` def tok2(d): return tok(d['question'], d['context']) ds = datasets.Dataset.from_dict({'question': questions, 'context': contexts}) tokens = ds.map(tok2, batched=True, batch_size=2) print(tokens['overflow_to_sample_mapping']) assert tokens['overflow_to_sample_mapping'] ==...