vramfs icon indicating copy to clipboard operation
vramfs copied to clipboard

Use as SWAP

Open snshn opened this issue 11 years ago • 33 comments

Was wondering if it could be possible to host a swap partition within vramfs or somehow patch vramfs to make it work as a swap partition?

My drive is encrypted, therefore I don't use SWAP partitions... but if this thing could give me 3GB or so of a swap-like fs, we could be onto something...

Do you think it could work without fuse, natively?

Oh, and great idea behind vramfs, really neat!

snshn avatar Dec 14 '14 16:12 snshn

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

Overv avatar Dec 14 '14 18:12 Overv

If you can provide a block device, then you can also build RAID-0 on top of the block devices.

ptman avatar Dec 14 '14 19:12 ptman

@ptman That is a great point. I'm going to look into writing a kernel module to do this tomorrow. I've tried BUSE, but it seems to be bottlenecking because it's based on the network block device interface.

Overv avatar Dec 14 '14 20:12 Overv

A kernel module and a some kind of analogue to swapon/swapoff would make this thing look very serious.

Both FUSE and BUSE would definitely only slow things down.

Good luck @Overv, thanks for sharing!

snshn avatar Dec 14 '14 20:12 snshn

I've done some preliminary testing with BUSE and trivial OpenCL code. The read speed is 1.1 GB/s and the write speed 1.5 GB/s with ext4. Writing my own kernel module is going to take more time, and it'll still require a userspace daemon to interact with OpenCL.

Overv avatar Dec 15 '14 21:12 Overv

Wow, very good news, @Overv!

I think the daemon is necessary just to provide the proper RAID support across multiple vramfs-based block devices and to control the amount of memory dedicated per adapter... I believe a package named vramfs-tools containing vramfsd and vramfsctl could fit the purpose...

Wondering what @torvalds will think of this project, maybe it'll end up being included in the tree like tmpfs... 4GB of VRAM on my Linux laptop feels like such a waste... bet I'm not the only one who feels that way.

Thanks for your work, once again!

snshn avatar Dec 15 '14 21:12 snshn

if you want a userspace-backed block (SCSI) device I would encourage you to look at TCMU, which was just added to Linux 3.18. It's part of the LIO kernel target. Using it along with the loopback fabric and https://github.com/agrover/tcmu-runner may fill in some missing pieces. tcmu-runner handles the "you need a daemon" part so the work would just consist of a vram-backed plugin for servicing SCSI commands like READ and WRITE. Then you'd have the basic block device, for swap or a filesystem or whatever.

(tcmu-runner is still alpha but I think it would save you from writing kernel code and a daemon from scratch. feedback welcome.)

agrover avatar Dec 15 '14 22:12 agrover

While it is technically possible to create a file on VRAMFS and use it as a swap, this is risky: What happens if VRAMFS itself, or one of the GPU libraries, gets swapped? This can happen in a low-mem situation, i.e exactly in a situation that swap is designed to help. The kernel cannot possibly know that restoring data from the swap depends on the data that is… swapped in the swap. This is not an issue for kernel-space filesystem/storage drivers because the kernel’s own RAM never gets swapped, but it is a conundrum for user-space stuff.

bisqwit avatar Jan 04 '20 10:01 bisqwit

For kernel-space driver, it would be nice to use directly TTM/GEM to allocate video ram buffers.

j123b567 avatar Jan 21 '20 11:01 j123b567

What are TTM/GEM?

Note that the slram/phram/mtdblock thing can only access at most like 256 MB of the memory, the size of the memory window (I guess) of the PCI device.

bisqwit avatar Jan 21 '20 12:01 bisqwit

I don't know much, but they are some interfaces to acces GPU memory inside kernel. So it can see all the GPU memory, not only some mapped part directly accesible. https://www.kernel.org/doc/html/latest/gpu/drm-mm.html

My situation, NVidia dedicated GPU with 4GB RAM and nouveau driver without OpenCL support. This memory is not mapped to memory space so I can't use them using slram/phram.

j123b567 avatar Jan 21 '20 13:01 j123b567

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

The easy way to accomplish is to use vmrafs as is, make a file on vramfs disk then use a loop device on that file, format the loop device with mkswap and then swapon. With this method everything seems to work as I tried. Anyway the big issue using FUSE or BUSE is that both runs in user space and user space is swappable. I have not tried it, but suppose the memory of the vramfs process get swapped itself by the kernel, how would the kernel be able to recover by a page fault as it needs to reload in the first place ? I am curious what will happen then?

Edit: sorry I was not reading the comments before as bisqwit already explained...anyway I've tried to use as swap after a while got system freezing need a hard reboot (switch off and on power sob)...

dhalsimax avatar Oct 06 '20 20:10 dhalsimax

What happens if VRAMFS itself, or one of the GPU libraries, gets swapped?

Couldn't mlockall be used to prevent vramfs from getting swapped?

LHLaurini avatar Nov 16 '20 17:11 LHLaurini

Wonderful idea! I am runnnig an old headless server with a 1 gb ddr3 amd card opencl 1.1. I can use all the video ram as i use just ssh. Unfortunately vramfs does not let me create a swap file based swap I get "swapon: /mnt/vram/swapfile: swapon failed: Invalid argument". Can it be fixed? I see opencl 1.2 is merged into mesa 20.3 so good times ahead for this project.

montvid avatar Dec 15 '20 00:12 montvid

It doesn't work for me. Even I tried to mlockall() page for the userspace program. I think the nvidia driver allocated some memory that would be swapped. At some point, the computer will get into deadlock when memory is low.

I also tried the BUSE / nbd approach. It doesn't work for me as well.

I think we need to get into the nvidia driver, carefully develop a block device kernel driver and call these undocumented API:

cat /proc/kallsyms |grep rm_gpu_ops | sort -k 3
0000000000000000 t rm_gpu_ops_address_space_create	[nvidia]
0000000000000000 t rm_gpu_ops_address_space_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_bind_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_channel_allocate	[nvidia]
0000000000000000 t rm_gpu_ops_channel_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_create_session	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_access_cntr_info	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_fault_info	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_session	[nvidia]
0000000000000000 t rm_gpu_ops_device_create	[nvidia]
0000000000000000 t rm_gpu_ops_device_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_disable_access_cntr	[nvidia]
0000000000000000 t rm_gpu_ops_dup_address_space	[nvidia]
0000000000000000 t rm_gpu_ops_dup_allocation	[nvidia]
0000000000000000 t rm_gpu_ops_dup_memory	[nvidia]
0000000000000000 t rm_gpu_ops_enable_access_cntr	[nvidia]
0000000000000000 t rm_gpu_ops_free_duped_handle	[nvidia]
0000000000000000 t rm_gpu_ops_get_channel_resource_ptes	[nvidia]
0000000000000000 t rm_gpu_ops_get_ecc_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_external_alloc_ptes	[nvidia]
0000000000000000 t rm_gpu_ops_get_fb_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_gpu_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_non_replayable_faults	[nvidia]
0000000000000000 t rm_gpu_ops_get_p2p_caps	[nvidia]
0000000000000000 t rm_gpu_ops_get_pma_object	[nvidia]
0000000000000000 t rm_gpu_ops_has_pending_non_replayable_faults	[nvidia]
0000000000000000 t rm_gpu_ops_init_access_cntr_info	[nvidia]
0000000000000000 t rm_gpu_ops_init_fault_info	[nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_fb	[nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_sys	[nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_map	[nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_ummap	[nvidia]
0000000000000000 t rm_gpu_ops_memory_free	[nvidia]
0000000000000000 t rm_gpu_ops_own_page_fault_intr	[nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_create	[nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_pma_alloc_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_free_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_pin_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_register_callbacks	[nvidia]
0000000000000000 t rm_gpu_ops_pma_unpin_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_unregister_callbacks	[nvidia]
0000000000000000 t rm_gpu_ops_query_caps	[nvidia]
0000000000000000 t rm_gpu_ops_query_ces_caps	[nvidia]
0000000000000000 t rm_gpu_ops_release_channel	[nvidia]
0000000000000000 t rm_gpu_ops_release_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_report_non_replayable_fault	[nvidia]
0000000000000000 t rm_gpu_ops_retain_channel	[nvidia]
0000000000000000 t rm_gpu_ops_retain_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_service_device_interrupts_rm	[nvidia]
0000000000000000 t rm_gpu_ops_set_page_directory	[nvidia]
0000000000000000 t rm_gpu_ops_stop_channel	[nvidia]
0000000000000000 t rm_gpu_ops_unset_page_directory	[nvidia]

to create a GPU session and allocate GPU memory in order to make a GPU swap truly possible.

wonghang avatar Jan 20 '21 09:01 wonghang

Hi guys, any update on this? Has anyone been able to reliably use VRAM as swap?

azureblue avatar Dec 03 '21 00:12 azureblue

It only works if the following two conditions are met:

  1. The GPU driver code/data is never put in swap
  2. The vramfs driver code/data is never put in swap. If you somehow can guarantee these aspects, then using VRAM as swap will work.

bisqwit avatar Dec 03 '21 00:12 bisqwit

Did not work for me the one time I tried it. Seems the project is abandoned...

montvid avatar Dec 03 '21 01:12 montvid

fuse should be able not to swap itself. But I attempted to add mlockall() in vramfs code, it didn't work either. It appears that GPU driver (nvidia) and CUDA libraries was swapped.

In nvidia driver, there are some undocumented functions (prefix by rm_, run cat /proc/kallsyms | grep nvidia to see) to access GPU memory. I think they are parts of GPUDirect RDMA (https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). If we can somehow hack them and write a kernel driver to handle the paging, it may be possible to use GPU as swap.

wonghang avatar Dec 03 '21 01:12 wonghang

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

Atrate avatar May 29 '22 13:05 Atrate

This looks like a proper solution indeed.

bisqwit avatar May 29 '22 14:05 bisqwit

I've tried implementing mlockall. If you want to, you can test whether it works for you and fixes deadlocks without needing to use a systemd service.

https://github.com/Overv/vramfs/pull/32

Atrate avatar Nov 27 '22 19:11 Atrate

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

twobombs avatar Jan 15 '23 12:01 twobombs

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

@twobombs

You can make a loop device with losetup but NVME RAID will probably be faster than vramswap, the performance is still somewhat lacking in certain areas.

Atrate avatar Jan 15 '23 18:01 Atrate

@Atrate thank you very much for the loop solution. will look into this and if ZFS will allow a loop device as cache. the swap I/O usage pattern is random read/write, not stream. a PCIe VRAM device might offer better speeds whilst at the same time making the workload on NVME raid devices more 'stream'-lined when changes are comitted to the array.

twobombs avatar Jan 15 '23 19:01 twobombs

I went a step further and added VRAM cache files for ZFS based SWAP. It is fairly hilarious to see IO come through on NVTOP

Screenshot_from_2023-02-01_20-10-18

twobombs avatar Feb 03 '23 12:02 twobombs

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

The solution seems to work for me, but when I increase swappiness from 10 to 180, it simply freezes. The same happens without increasing swappiness when running mprime.

I am running vramfs as a service, as the workaround cited above suggests. The only thing I think I am doing different is using a loopback, as my swapfile is being created with holes.

Does anyone have an idea of what is happening?

UPDATE: I tracked the last boot journal, and it stated the following error: kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] ERROR [CRTC:82:crtc-0] hw_done or flip_done timed out

aedalzotto avatar Aug 03 '23 00:08 aedalzotto

In reply to: https://github.com/Overv/vramfs/issues/3#issuecomment-1663130681

As suggested by Fanzhuyifan and others above I think that may be due to other GPU-management processes/libraries getting swapped out and maaaybe a fix is possible with a lot of systemd unit editing but that'd require tracking down every single library and process that is required for the operation of a dGPU and that seems like a chore.

Atrate avatar Aug 12 '23 14:08 Atrate

According to the documentation of mlockall,

mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

So shared libraries directly used by vramfs being swapped out should not be the reason of system freezes.

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

fanzhuyifan avatar Aug 13 '23 08:08 fanzhuyifan

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Is it? mlockall is called with the MCL_CURRENT | MCL_FUTURE flags, so it should also prevent all future allocations of memory from being swapped, unless I misunderstood the documentation.

Code in vramfs: https://github.com/Overv/vramfs/blob/829b1f2c259da2eb63ed3d4ddef0eeddb08b99e4/src/vramfs.cpp#L534

Documentation:

       MCL_CURRENT
              Lock all pages which are currently mapped into the address
              space of the process.

       MCL_FUTURE
              Lock all pages which will become mapped into the address
              space of the process in the future.  These could be, for
              instance, new pages required by a growing heap and stack
              as well as new memory-mapped files or shared memory
              regions.

Atrate avatar Aug 13 '23 15:08 Atrate