FastChat Byte deltas

Instead of using parameter deltas this implementation compares each byte in the delta and in the LLaMA model and outputs the vicuna model.

This offers significntly less RAM usage compared to the original implementation.

https://github.com/lm-sys/FastChat/issues/560

May 08 '23 13:05 RedmiS22018

@andy-yang-1 please try and verify the PR

May 09 '23 10:05 zhisbug

Nice work! @RedmiS22018 I encountered one issue when running your code, and I would like to bring it to you:

On Hugging Face, there are two versions of llama weights. I ran into issues with the file_stream loading when I chose the llama-7b-hf version. Could you please add support for the llama-7b-hf version in your code?

btw, I suspect that using bytewise loading for the files might introduce potential risks, as the bin file is a serialized file generated from a pickle dump. The state dict can be correctly loaded only if every position in the file corresponds to the exact same content. If the serialization order changes during a version update, could this render the apply method ineffective?

May 09 '23 12:05 andy-yang-1

Hi there @andy-yang-1!

On Hugging Face, there are two versions of llama weights. I ran into issues with the file_stream loading when I chose the llama-7b-hf version. Could you please add support for the llama-7b-hf version in your code?

I'm not aware of there being 2 versions of the original LLaMA weights. If you follow the instructions in the Hugging Face documentation you will always end up with the same model weights. Are you referring to vicuna-7b v0 and vicuna-7b v1.1? Can you please explain what you meant by there being two versions of the weights?

btw, I suspect that using bytewise loading for the files might introduce potential risks, as the bin file is a serialized file generated from a pickle dump. The state dict can be correctly loaded only if every position in the file corresponds to the exact same content. If the serialization order changes during a version update, could this render the apply method ineffective?

If an update to vicuna is released new delta files will have to be created. The new delta files will override the old serialization order and use the new order. After the delta files are applied the output models will have the exact same hash as the vicuna model that was used to generate the delta weights, ensuring that the weights don't break, meaning that the apply method will not be rendered ineffective by an update.

May 09 '23 17:05 RedmiS22018

Hi @RedmiS22018,

Thank you for your prompt response!

I wanted to provide you with the link to the other version of the llama weights

After the delta files are applied the output models will have the exact same hash as the vicuna model that was used to generate the delta weights, ensuring that the weights don't break

I can get your points, and I think it is safe to use file stream to apply delta. Would you like to support my given llama weight version? It's not a problem if you don't support it. I noticed that the llama weight has been removed from the inference part as well 😂

May 10 '23 12:05 andy-yang-1

Is that the same model, just split into 1 GB instead of 10 GB? In other words: Will combining the pytorch model files into 10 GB files result in the same files as the first version of the weights?

May 10 '23 12:05 RedmiS22018

Its tokenizer has a different name LlamaTokenizer @RedmiS22018

May 10 '23 12:05 andy-yang-1

I'm not sure if it's possible to support this version without:

Creating a version of the vicuna model that is split into 1 GB files, and creating delta files for this version of the vicuna model
Converting this version to the other version before applying the delta files, which would require loading the model into ram, defeating the entire purpose of using byte delta files

As you've mentioned: It's not a problem if this version isn't supported. Users will just follow the instructions in the Hugging Face documentation and end up with that version of the weights, and they'll be able to apply the delta and end up with the vicuna weights.

Please let me know if I'm missing something and there is a way to support this version of the llama weights without needing to load the model into ram or needing to create another version of the vicuna weights

May 10 '23 13:05 RedmiS22018

I agree. We just need to support the normal weight

May 10 '23 14:05 andy-yang-1

sorry, but this is not a good idea,

it multiplies the ways in which the conversion process can go wrong, and also, instead of having a couple of versions of this model floating around you will have dozens, and mega confusion,

instead,

Just add an instruction on how to increase the swap file on your machine. It's not difficult and way more reliable. Since you only need to apply a delta once, the extra time it takes is well offset by not having to hunt for the just right version of the 22GB binary.

I vote to reject this.

May 12 '23 09:05 jerzydziewierz

@RedmiS22018 RedmiS22018 simply increase the swap file size on your machine. Since you only do it once, trading out the time is fine.

# === Increase swap available

# create an empty file
echo "reserving by writing to /swapfile2..."
# note: "count" is the count of 1GB blocks to reserve. 320 -> 320GB:
dd if=/dev/zero of=/swapfile2 bs=1G count=320

# bake the swap file
echo "baking /swapfile2 ..."
chmod 0600 /swapfile2
mkswap /swapfile2

# it is safe to skip this if you do not need the swap file in the next boot.
echo "registering..."
echo "/swapfile2 swap swap sw 0 0" >> /etc/fstab

# activate
echo "activating...."
swapon /swapfile2
# verify
cat /proc/swaps

after the process is done, remove that swap file:

echo "turning off the swap file..."
swapoff /swapfile2
echo "de-registering from fstab..."
sed -i "/swapfile2" /etc/fstab
echo "removing the file itself"
rm /swapfile2

May 12 '23 10:05 jerzydziewierz

Hi there @jerzydziewierz, Thank you for sharing your suggestion of increasing the swap file size. However, I'd like to point out that this method doesn't work on macOS. Modifying the swap file is not recommended on macOS, as the system handles swap files differently and automatically through the dynamic_pager daemon. Manual modification could potentially lead to unexpected results or stability issues.

Regarding your concerns about bytewise deltas, I'd like to address them as follows:

Bytewise delta files wouldn't introduce any additional risks or errors during the process.
Bytewise delta files wouldn't require users to search for the correct version of the weights. The officially supported HF format, obtained by following the documentation's instructions, should be sufficient.
Bytewise delta files wouldn't increase the number of hosted model versions. We would only need two versions (vicuna-7b and vicuna-13b), in line with the existing setup.

Additionally, bytewise deltas offer several benefits:

They significantly reduce the time needed to apply the delta, thus improving the efficiency of the process.
Bytewise deltas require less RAM, making the process more accessible for systems with lower memory capacity.

I hope this clarifies your concerns about bytewise deltas and highlights their advantages.

May 12 '23 11:05 RedmiS22018

@RedmiS22018

The reason why this doesn't work on macOS and Windows is that, as long as there is disk space available, these OSes will simply increase the size of the swap to whatever is needed to complete the task. Hence, no problem. You can process your 60GB problem just fine on these platforms, it will merely take more patience.

Let me respond to your points:

Granted.
Do you mean that you will spend your time and money to host the new binaries? what is your budget, in terms of hours per week and terabytes of bandwidth per week that you are willing to sponsor? and even if yes, then see the next point.
You have just agreed to increase the number of versions: there will be the original "delta weights" version and the new "bytewise" version, so you have multiplied the count of options by 2x. You will not convince me that the original format will disappear.

Additionally,

That's true. It will reduce the system load, once, assuming that everything goes fine the first time around. In case of any problems, and there will be some, the user will have to download more, read more documentation, and confuse people that are willing to help but have done this the original way.
That's only partially true: it only helps systems that are too small to run it anyway, and even if they really want to spend the time doing that, then they can simply do it the regular way.

TLDR: This is not needed, and comes with costs.

Dear @RedmiS22018, I appreciate your effort. Really. I know what you feel. Please use this as a hint from a friend, the kind that does not shy from telling you that there are better ways.

Overall,

I still vote to reject this PR, on the basis that the benefit does not justify the cost.

May 13 '23 07:05 jerzydziewierz

We will release merged weights directly later. Thanks for the contribution, but I will close this for now.

Jun 18 '23 04:06 merrymercy

Thank you for the update. Releasing the merged weights directly is a better soulution.

Jun 18 '23 21:06 RedmiS22018

FastChat FastChat copied to clipboard

Byte deltas

FastChat
FastChat copied to clipboard