private-gpt Add mlock option to fix SSD Reads + Slow generation

This change is disabled by default just in case it breaks something on some configurations, however i tested it without experiencing any issue

Fixes the massive SSD reads (1TB+ for 100 tokens) and slow generation speeds (5 seconds per token) seen on some devices like the M1 Air 8GB when there is slightly less RAM than needed.

When MODEL_USE_MLOCK is set to true in .env, llama.cpp will lock the model in RAM instead of reading from the SSD. This is useful because even if llama.cpp needs like 50mb of RAM more than what is available, it reads directly from disk instead of asking the OS to move some other programs to the swap or compress the memory a bit, resulting in extremely degraded performance.

May 27 '23 13:05 Belluxx

The GPTJ backend is designed for high-performance tasks and it might be possible that the settings you have provided do not align with the memory constraints of someone's system. use_mlock is a memory-locking option. If you have it set to True, it may prevent the system from swapping some of the data to the disk when memory is low, causing the program to run out of memory.

May 30 '23 09:05 sime2408

@sime2408 Yes, however it's set to False by default. This way, the average user with little RAM on his laptop can use this tool at reasonable performance without reading Terabytes of data from the SSD just for geenerating a couple tokens.

It looks like a win-win scenario for me, it's just another option that is disabled by default and may be very useful to the average user.

May 30 '23 11:05 Belluxx