TensorNVMe
TensorNVMe copied to clipboard
A Python library transfers PyTorch tensors between CPU and NVMe
TensorNVME
A Python Library provides APIs to move PyTorch Tensors between CPU and NVMe.
Dependencies
Install
This package is only supported on Linux. liburing and libaio can be automatically installed. liburing is supported on Linux >= 5.10, and it won't be installed if the version of your Linux < 5.10.
It will search libaio and liburing in /usr/lib, /usr/lib64 and $LD_LIBRARY_PATH. If not found, backends will be installed in ~/.tensornvme, and ~/.bashrc will be modified to set $LD_LIBRARY_PATH correctly. Please source ~/.bashrc after installation. If you use other shells, please make sure $LD_LIBRARY_PATH is set correctly.
You must install pytorch and cmake before installing tensornvme. Once you upgrade pytorch, remember to reinstall tensornvme.
From source
git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe
First, install requirements:
pip install -r requirements.txt
To install tensornvme with liburing and libaio:
pip install -v --no-cache-dir .
To install tensornvme with only liburing:
DISABLE_AIO=1 pip install -v --no-cache-dir .
To install tensornvme with only libaio:
DISABLE_URING=1 pip install -v --no-cache-dir .
If you want to install libaio or liburing for system:
WITH_ROOT=1 sudo pip install -v --no-cache-dir .
Then they will be installed in /usr and ~/.bashrc will not be modified. Make sure you have root access.
From PIP
pip install packaging
pip install tensornvme
All acceptable environment variables are the same as those when installing from source.
Use docker
git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe/docker && docker build -t tensornvme .
CLI
We provide a CLI to test whether backends work well.
tensornvme check
Usage
It provide both synchronize and asynchronize I/O API.
Only CPU and contiguous tensors can be offloaded.
Synchronize API:
import torch
from tensornvme import DiskOffloader
x = torch.rand(2, 2)
y = torch.rand(4, 4, 4)
offloader = DiskOffloader('./offload')
offloader.sync_write(x)
# x is saved to a file on disk (in ./offload folder) and the memory of x is freed
offloader.sync_read(x)
# x is restored
offloader.sync_writev([x, y])
# x and y are offloaded
offloader.sync_readv([x, y])
# x and y are restored.
# sync_writev() and sync_readv() are order sensitive
# E.g. sync_writev([x, y]) and sync_writev([y, x]) are different
Asynchronize API:
import torch
from tensornvme import DiskOffloader
x = torch.rand(2, 2)
y = torch.rand(4, 4, 4)
offloader = DiskOffloader('./offload')
offloader.async_write(x)
# x is being offloaded in the background
offloader.sync_write_events()
# x is offloaded and the memory of x is freed
offloader.async_read(x)
# x is being restored in the background
offloader.sync_read_events()
# x is restored
offloader.async_writev([x, y])
# x and y are being offloaded in the background
offloader.synchronize()
# synchronize() will synchronize both write and read events.
offloader.async_readv([x, y])
offloader.synchronize()
# x and y are restored.
# async_writev() and async_readv() are also order sensitive
You can use asynchronize API to overlap computation and data moving.
tensors = []
for _ in range(10):
tensor = torch.rand(2, 2)
tensors.append(tensor)
offloader.sync_write(tensor)
offloader.sync_read(tensors[0])
# prefetch=1, writing tensor[i] and reading tensor[i+1]
for i, tensor in enumerate(tensors):
offloader.sync_read_events()
if i + 1 < len(tensors):
offloader.async_read(tensors[i+1])
tensor.mul_(2.0) # compute
offloader.sync_write_events()
offloader.async_write(tensor)
offloader.synchronize()
How to test
We have C++ test scrpits for AsyncIO and SpaceManager class. Make sure you have installed liburing and libaio, and set environment variables correctly before testing. To run the tests:
mkdir build
cd build
cmake ..
make
./test_asyncio
./test_space_mgr
We also have python unit tests. Make sure you have installed pytest. To run:
pytest ./tests
How to benchmark
We have benchmarks for Adam and CpuAdam with different backend and prefetch depth to validate TensorNVME's speed. To run the benchmark:
cd benchmark
python benchmark_adam.py
python benchmark_cpuadam.py