safetensors icon indicating copy to clipboard operation
safetensors copied to clipboard

Poor performance for reading Numpy

Open ZHUI opened this issue 10 months ago • 7 comments

System Info

I test 7b model with fp32 weight, store with numpy format. I found that compared with pickle, the loading speed is slower more than 50% !!!

-rw-r--r-- 1 root root 3.6G Apr  2 16:26 checkpoint-12/model-00001-of-00008.pdparams
-rw-r--r-- 1 root root 3.6G Apr  2 16:32 checkpoint-12/model-00001-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:26 checkpoint-12/model-00002-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:32 checkpoint-12/model-00002-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:27 checkpoint-12/model-00003-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:32 checkpoint-12/model-00003-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:27 checkpoint-12/model-00004-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:32 checkpoint-12/model-00004-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:27 checkpoint-12/model-00005-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:32 checkpoint-12/model-00005-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:27 checkpoint-12/model-00006-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:33 checkpoint-12/model-00006-of-00008.safetensors
-rw-r--r-- 1 root root 3.1G Apr  2 16:28 checkpoint-12/model-00007-of-00008.pdparams
-rw-r--r-- 1 root root 3.1G Apr  2 16:33 checkpoint-12/model-00007-of-00008.safetensors
-rw-r--r-- 1 root root 3.6G Apr  2 16:28 checkpoint-12/model-00008-of-00008.pdparams
-rw-r--r-- 1 root root 3.6G Apr  2 16:33 checkpoint-12/model-00008-of-00008.safetensors
-rw-r--r-- 1 root root  25K Apr  2 16:10 checkpoint-12/model.safetensors.index.json

time usage.

sf model-00001-of-00008.safetensors 3.4121220111846924
pk model-00001-of-00008.safetensors 2.195117473602295
sf model-00002-of-00008.safetensors 3.004627227783203
pk model-00002-of-00008.safetensors 1.9284331798553467
sf model-00003-of-00008.safetensors 2.887206792831421
pk model-00003-of-00008.safetensors 1.8887608051300049
sf model-00004-of-00008.safetensors 2.8507916927337646
pk model-00004-of-00008.safetensors 2.080396890640259
sf model-00005-of-00008.safetensors 2.830484390258789
pk model-00005-of-00008.safetensors 1.8540270328521729
sf model-00006-of-00008.safetensors 2.8113412857055664
pk model-00006-of-00008.safetensors 1.916459321975708
sf model-00007-of-00008.safetensors 2.8474719524383545
pk model-00007-of-00008.safetensors 1.835508108139038
sf model-00008-of-00008.safetensors 3.3513264656066895
pk model-00008-of-00008.safetensors 2.2008140087127686

Information

  • [X] The official example scripts
  • [X] My own modified scripts

Reproduction

python3.10 safetensors=0.4.2

Expected behavior

$ cat test_safetensor.py

import pickle
import numpy as np
from safetensors.numpy import load_file
from safetensors.numpy import save_file

w = np.random.randn(256,1024,1024)

state_dict = {"weight": w}

import time
fp = "test_file.safetensors"
t1 = time.time()
save_file(state_dict,fp)
print("sf save:",  time.time()-t1)

t1 = time.time()
state = load_file(fp) 
print("sf load:", time.time()-t1)


fp = fp.replace(".safetensors", ".pickle")
t1 = time.time()
state = pickle.dump(state_dict, open(fp, "wb"))
print("pickle save:",  time.time()-t1)
 

t1 = time.time()
state = pickle.load(open(fp, "rb"))
print("pickle load:", time.time()-t1)

results:

sf save: 2.818842887878418 sf load: 1.8608193397521973 pickle save: 2.3684301376342773 pickle load: 1.004188060760498

ZHUI avatar Apr 02 '24 09:04 ZHUI

@Narsil please help

ZHUI avatar Apr 02 '24 09:04 ZHUI

sf save: 2.6092689037323
sf load: 1.7551813125610352
sf load2: 2.1516408920288086
pickle save: 2.2464730739593506
pickle load: 0.9010257720947266

The load API is even more slow than load_file

import pickle
import numpy as np
from safetensors.numpy import load_file, load
from safetensors.numpy import save_file

w = np.empty([256,1024,1024])

state_dict = {"weight": w}

import time
fp = "test_file.safetensors"
t1 = time.time()
save_file(state_dict,fp)
print("sf save:",  time.time()-t1)

t1 = time.time()
state = load_file(fp) 
print("sf load:", time.time()-t1)


t1 = time.time()
with open(fp, "rb") as f:
    data = f.read()
loaded = load(data)
print("sf load2:", time.time()-t1)

fp = fp.replace(".safetensors", ".pickle")
t1 = time.time()
state = pickle.dump(state_dict, open(fp, "wb"))
print("pickle save:",  time.time()-t1)
 

t1 = time.time()
state = pickle.load(open(fp, "rb"))
print("pickle load:", time.time()-t1)
    

ZHUI avatar Apr 02 '24 09:04 ZHUI

@mishig25 can you give some help?

ZHUI avatar Apr 02 '24 10:04 ZHUI

@LysandreJik can you give some help?

ZHUI avatar Apr 03 '24 05:04 ZHUI

For load_file API

The core problem is memcpy for mmap memory is very slow. see:

https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance

for my case, open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s. which is mach more slower than read file.

Can we have more faster way to support.

For load API

There are additional MEM->MEM copy time!!!

f.read() copied file to the MEM, but the load API use PyByteArray::new and cause additional MEM->MEM copy!

https://github.com/huggingface/safetensors/blob/b947b59079a6197d7930dfb535818ac4896113e8/bindings/python/src/lib.rs#L158

Advise

Can we support loading file without additional MEM->MEM? If memcpy + mmap is inevitable, can we have substitution?

ZHUI avatar Apr 10 '24 03:04 ZHUI

for my case, open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s. which is mach more slower than read file.

Something is wrong in your system, what are you using ? Windows + WSL is a usual culprit for very poor mmap support/performance. HDD are also a big source of it although they are much less commonplace these days.

In order to make things "fast" we could always skip a few things, but that makes the thing unsafe (necessarily since Python doesn't have ownership semantics).

Pyo3 0.21 could enable something a bit faster though since we could skip the rust owned version of the tensors.

Narsil avatar Apr 15 '24 10:04 Narsil

see https://github.com/PyO3/pyo3/issues/4058#issuecomment-2046471081

https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance

my os ubuntu 18.04. you can have a test using above scripts.

There are some suggestions for using madvice(.., MADV_SEQUENTIAL); https://github.com/PyO3/pyo3/issues/4058#issuecomment-2048119528

ZHUI avatar Apr 15 '24 11:04 ZHUI

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 16 '24 01:05 github-actions[bot]

Still a big problem.

ZHUI avatar May 16 '24 03:05 ZHUI

Is the rust component handling the file IO using a buffered or unbuffered read? Last I looked, i don't think it was a buffered read.

I am seeing very poor read performance from an enterprise U.2 SSD on Windows 11, possibly triggered by the rust doing a lot of small read transactions on a large file (instead of fewer large read transactions), likely made worse due to a software encryption layer that handles a large number of very small IOs relatively poorly.

brendanhoar avatar Jun 08 '24 15:06 brendanhoar

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 09 '24 01:07 github-actions[bot]