FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Proposal: Distribute Model Weights as Byte Delta Weights

Open RedmiS22018 opened this issue 2 years ago • 2 comments
trafficstars

Instead of needing to load the weights into memory compare every byte in the LLaMA & Delta files and add the delta to the bytes, loading 4KB at a time, this approach reduces the RAM usage from approximately 60 GB for vicuna-13B to around 4 KB, while achieving the same objective. Additionally, this significantly reduces the time required to apply the deltas, assuming that you use C++. In my implementation python was found to be too slow when creating or applying the deltas. By distributing model weights as byte delta weights, the same functionality is preserved while offering significant performance enhancements.

Here is some example C++ code to create a delta file:

#include <iostream>
#include <fstream>
#include <vector>

void CreateDeltas(const std::string &originalFile, const std::string &newFile, const std::string &deltas) {
    const size_t bufferSize = 4096;
    std::ifstream origStream(originalFile, std::ios::binary);
    std::ifstream newStream(newFile, std::ios::binary);
    std::ofstream deltaStream(deltas, std::ios::binary);

    std::vector<unsigned char> origBuffer(bufferSize);
    std::vector<unsigned char> newBuffer(bufferSize);
    std::vector<unsigned char> deltaBuffer(bufferSize);

    size_t bytesRead;
    while (newStream.read(reinterpret_cast<char*>(&newBuffer[0]), bufferSize), (bytesRead = newStream.gcount()) > 0) {
        origStream.read(reinterpret_cast<char*>(&origBuffer[0]), bytesRead);

        for (size_t i = 0; i < bytesRead; i++) {
            deltaBuffer[i] = (static_cast<unsigned char>(newBuffer[i] - origBuffer[i]) % 256);
        }
        deltaStream.write(reinterpret_cast<char*>(&deltaBuffer[0]), bytesRead);
    }
}

Here is some example C++ code to apply a delta file:

#include <iostream>
#include <fstream>
#include <vector>

void ApplyDeltas(const std::string &originalFile, const std::string &deltas, const std::string &newFile) {
    const size_t bufferSize = 4096;

    std::ifstream origStream(originalFile, std::ios::binary);
    std::ifstream deltaStream(deltas, std::ios::binary);
    std::ofstream newFileStream(newFile, std::ios::binary);

    std::vector<unsigned char> origBuffer(bufferSize);
    std::vector<unsigned char> deltaBuffer(bufferSize);
    std::vector<unsigned char> newReadBuffer(bufferSize);

    size_t bytesRead;
    while (deltaStream.read(reinterpret_cast<char*>(&deltaBuffer[0]), bufferSize), (bytesRead = deltaStream.gcount()) > 0) {
        origStream.read(reinterpret_cast<char*>(&origBuffer[0]), bytesRead);

        for (size_t i = 0; i < bytesRead; i++) {
            newReadBuffer[i] = (static_cast<unsigned char>(origBuffer[i] + deltaBuffer[i]) % 256);
        }
        newFileStream.write(reinterpret_cast<char*>(&newReadBuffer[0]), bytesRead);
    }
}

RedmiS22018 avatar Apr 22 '23 19:04 RedmiS22018

@RedmiS22018 looks great. Do you mind submitting and end-to-end PR for this feature? Thanks

zhisbug avatar May 08 '23 09:05 zhisbug

@zhisbug I've created a pull request here https://github.com/lm-sys/FastChat/pull/1045.

RedmiS22018 avatar May 08 '23 13:05 RedmiS22018