safetensors
safetensors copied to clipboard
Poor performance for reading Numpy
System Info
I test 7b model with fp32 weight, store with numpy format. I found that compared with pickle, the loading speed is slower more than 50% !!!
-rw-r--r-- 1 root root 3.6G Apr 2 16:26 checkpoint-12/model-00001-of-00008.pdparams
-rw-r--r-- 1 root root 3.6G Apr 2 16:32 checkpoint-12/model-00001-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:26 checkpoint-12/model-00002-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:32 checkpoint-12/model-00002-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:27 checkpoint-12/model-00003-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:32 checkpoint-12/model-00003-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:27 checkpoint-12/model-00004-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:32 checkpoint-12/model-00004-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:27 checkpoint-12/model-00005-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:32 checkpoint-12/model-00005-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:27 checkpoint-12/model-00006-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:33 checkpoint-12/model-00006-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr 2 16:28 checkpoint-12/model-00007-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr 2 16:33 checkpoint-12/model-00007-of-00008.safetensors
-rw-r--r-- 1 root root 3.6G Apr 2 16:28 checkpoint-12/model-00008-of-00008.pdparams
-rw-r--r-- 1 root root 3.6G Apr 2 16:33 checkpoint-12/model-00008-of-00008.safetensors
-rw-r--r-- 1 root root 25K Apr 2 16:10 checkpoint-12/model.safetensors.index.json
time usage.
sf model-00001-of-00008.safetensors 3.4121220111846924
pk model-00001-of-00008.safetensors 2.195117473602295
sf model-00002-of-00008.safetensors 3.004627227783203
pk model-00002-of-00008.safetensors 1.9284331798553467
sf model-00003-of-00008.safetensors 2.887206792831421
pk model-00003-of-00008.safetensors 1.8887608051300049
sf model-00004-of-00008.safetensors 2.8507916927337646
pk model-00004-of-00008.safetensors 2.080396890640259
sf model-00005-of-00008.safetensors 2.830484390258789
pk model-00005-of-00008.safetensors 1.8540270328521729
sf model-00006-of-00008.safetensors 2.8113412857055664
pk model-00006-of-00008.safetensors 1.916459321975708
sf model-00007-of-00008.safetensors 2.8474719524383545
pk model-00007-of-00008.safetensors 1.835508108139038
sf model-00008-of-00008.safetensors 3.3513264656066895
pk model-00008-of-00008.safetensors 2.2008140087127686
Information
- [X] The official example scripts
- [X] My own modified scripts
Reproduction
python3.10 safetensors=0.4.2
Expected behavior
$ cat test_safetensor.py
import pickle
import numpy as np
from safetensors.numpy import load_file
from safetensors.numpy import save_file
w = np.random.randn(256,1024,1024)
state_dict = {"weight": w}
import time
fp = "test_file.safetensors"
t1 = time.time()
save_file(state_dict,fp)
print("sf save:", time.time()-t1)
t1 = time.time()
state = load_file(fp)
print("sf load:", time.time()-t1)
fp = fp.replace(".safetensors", ".pickle")
t1 = time.time()
state = pickle.dump(state_dict, open(fp, "wb"))
print("pickle save:", time.time()-t1)
t1 = time.time()
state = pickle.load(open(fp, "rb"))
print("pickle load:", time.time()-t1)
results:
sf save: 2.818842887878418 sf load: 1.8608193397521973 pickle save: 2.3684301376342773 pickle load: 1.004188060760498
@Narsil please help
sf save: 2.6092689037323
sf load: 1.7551813125610352
sf load2: 2.1516408920288086
pickle save: 2.2464730739593506
pickle load: 0.9010257720947266
The load
API is even more slow than load_file
import pickle
import numpy as np
from safetensors.numpy import load_file, load
from safetensors.numpy import save_file
w = np.empty([256,1024,1024])
state_dict = {"weight": w}
import time
fp = "test_file.safetensors"
t1 = time.time()
save_file(state_dict,fp)
print("sf save:", time.time()-t1)
t1 = time.time()
state = load_file(fp)
print("sf load:", time.time()-t1)
t1 = time.time()
with open(fp, "rb") as f:
data = f.read()
loaded = load(data)
print("sf load2:", time.time()-t1)
fp = fp.replace(".safetensors", ".pickle")
t1 = time.time()
state = pickle.dump(state_dict, open(fp, "wb"))
print("pickle save:", time.time()-t1)
t1 = time.time()
state = pickle.load(open(fp, "rb"))
print("pickle load:", time.time()-t1)
@mishig25 can you give some help?
@LysandreJik can you give some help?
For load_file
API
The core problem is memcpy for mmap memory is very slow. see:
https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance
for my case, open(filename); f.read()
is 2 GB/s, for memcpy(mmap(filename))
is 1.3 GB/s. which is mach more slower than read file.
Can we have more faster way to support.
For load
API
There are additional MEM->MEM copy time!!!
f.read()
copied file to the MEM, but the load API use PyByteArray::new
and cause additional MEM->MEM copy!
https://github.com/huggingface/safetensors/blob/b947b59079a6197d7930dfb535818ac4896113e8/bindings/python/src/lib.rs#L158
Advise
Can we support loading file without additional MEM->MEM?
If memcpy + mmap
is inevitable, can we have substitution?
for my case, open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s. which is mach more slower than read file.
Something is wrong in your system, what are you using ? Windows + WSL is a usual culprit for very poor mmap support/performance. HDD are also a big source of it although they are much less commonplace these days.
In order to make things "fast" we could always skip a few things, but that makes the thing unsafe
(necessarily since Python doesn't have ownership semantics).
Pyo3 0.21 could enable something a bit faster though since we could skip the rust owned version of the tensors.
see https://github.com/PyO3/pyo3/issues/4058#issuecomment-2046471081
https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance
my os ubuntu 18.04. you can have a test using above scripts.
There are some suggestions for using madvice(.., MADV_SEQUENTIAL);
https://github.com/PyO3/pyo3/issues/4058#issuecomment-2048119528
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Still a big problem.
Is the rust component handling the file IO using a buffered or unbuffered read? Last I looked, i don't think it was a buffered read.
I am seeing very poor read performance from an enterprise U.2 SSD on Windows 11, possibly triggered by the rust doing a lot of small read transactions on a large file (instead of fewer large read transactions), likely made worse due to a software encryption layer that handles a large number of very small IOs relatively poorly.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.