llm icon indicating copy to clipboard operation
llm copied to clipboard

Parallel loading of the model tensors

Open philpax opened this issue 1 year ago • 5 comments

People have reported faster loading of the models in upstream when the tensors are loaded in parallel: https://github.com/ggerganov/llama.cpp/issues/85

This should be pretty easy to do with Rust if we convert loading to an iter and then use par_iter instead. It seems like this should be I/O bound, but perhaps the actual loading process has computational overhead?

philpax avatar Mar 26 '23 13:03 philpax

Sort of related to speeding up loading, I've been messing around with rewriting it to use a mmap-based approach and nom. I don't know if it's really on the right track.

This is what just loading the header and vocabulary looks like:

pub mod mmap_loader {
    use mmap_rs::{MmapFlags, MmapOptions};
    #[allow(unused_imports)]
    use nom::{
        branch::alt,
        bytes::complete as nby,
        combinator as ncom,
        error::ParseError,
        multi as nm,
        number::complete::{self as nnum, le_f32, le_i32, le_u32},
        sequence as nseq, IResult, Parser, Slice,
    };
    use std::fs::File;

    use super::*;

    pub struct Flib;

    #[derive(Debug)]
    struct Header {
        legacy: bool,
        hyper: Hyperparameters,
    }

    impl Flib {
        fn parse_header(i: &[u8]) -> IResult<&[u8], Header> {
            let (i, magic) = le_i32(i)?;
            let legacy = match magic {
                ggml::FILE_MAGIC => false,
                ggml::FILE_MAGIC_UNVERSIONED => true,
                _ => return nom::error::context("ohno", ncom::fail)(i),
            };
            ncom::map(Flib::parse_hyperparameters, move |hyper| Header {
                legacy,
                hyper,
            })(i)
        }

        fn parse_hyperparameters(i: &[u8]) -> IResult<&[u8], Hyperparameters> {
            ncom::map(
                nseq::tuple((le_i32, le_i32, le_i32, le_i32, le_i32, le_i32, le_i32)),
                |(n_vocab, n_embd, n_mult, n_head, n_layer, n_rot, f16_)| Hyperparameters {
                    n_vocab,
                    n_ctx: 0,
                    n_embd,
                    n_mult,
                    n_head,
                    n_layer,
                    n_rot,
                    f16_,
                },
            )(i)
        }

        fn parse_vocabulary<'a>(i: &'a [u8], hdr: &Header) -> IResult<&'a [u8], Vocabulary> {
            const TOKEN_PLACEHOLDER: &str = "�";
            let n_vocab = hdr.hyper.n_vocab as usize;
            let legacy = hdr.legacy;
            let mut id_to_token = Vec::with_capacity(n_vocab);
            let mut id_to_token_score = Vec::with_capacity(n_vocab);
            let mut token_to_id = HashMap::with_capacity(n_vocab);
            let vocabitem_parser = |i| {
                nseq::tuple((nm::length_data(le_u32), ncom::cond(!legacy, le_f32)))(i)
                    .map(|(i, (sbytes, score))| (i, (sbytes, score.unwrap_or_default())))
            };
            let folf = |mut mtl: usize, (sbytes, score)| {
                let tid = id_to_token.len();
                let (ok, token) = std::str::from_utf8(sbytes).map_or_else(
                    |_| (false, TOKEN_PLACEHOLDER.to_string()),
                    |s| (true, s.to_string()),
                );
                if ok {
                    mtl = mtl.max(token.len());
                    token_to_id.insert(token.clone(), tid as TokenId);
                }
                id_to_token.push(token);
                id_to_token_score.push(score);
                mtl
            };
            let (i, max_token_length) =
                nm::fold_many_m_n(n_vocab, n_vocab, vocabitem_parser, || 0, folf)(i)?;
            IResult::Ok((
                i,
                Vocabulary {
                    id_to_token,
                    id_to_token_score,
                    token_to_id,
                    max_token_length,
                },
            ))
        }

        pub fn load(path: impl AsRef<Path>) -> Result<(), LoadError> {
            let path = path.as_ref();
            let fp = File::open(path).map_err(|e| LoadError::OpenFileFailed {
                source: e,
                path: path.to_owned(),
            })?;
            let flen = fp.metadata()?.len();
            let m = unsafe {
                MmapOptions::new(flen as usize).and_then(|mo| {
                    mo.with_file(fp, 0)
                        .with_flags(MmapFlags::NO_CORE_DUMP)
                        .map()
                })
            }
            .map_err(|e| LoadError::MmapFailed { source: e })?;
            let mb = m.as_slice();
            let (i, hdr) = Self::parse_header(mb).unwrap();
            println!("Got: {hdr:?}");
            let (i, vocab) = Self::parse_vocabulary(i, &hdr).unwrap();
            println!(
                "Got: {} - {} - {}",
                vocab.max_token_length,
                vocab.id_to_token.len(),
                vocab.token_to_id.len()
            );
            Ok(())
        }
    }
}

I honestly don't really love parsers in Rust, it's so much nicer in Haskell but I guess this is more readable than the current code. A long time ago, I experimented with trying to combine nom and monadic do type notation but it wasn't really practical: https://github.com/KerfuffleV2/mdoexperiments

KerfuffleV2 avatar Mar 26 '23 23:03 KerfuffleV2

Along the lines of programmatic parsing, it might also be interesting to explore the use of https://github.com/jam1garner/binrw.

Not sure how that would impact parallel loading or #93, though.

philpax avatar Apr 02 '23 10:04 philpax

Interesting. Weirdly enough, that actually only has limited support for non-streams (i.e. mmap). I don't know if it would be necessary to use the seek features for handling the GGML format, but if so that would mean mmaping was impossible.

KerfuffleV2 avatar Apr 02 '23 11:04 KerfuffleV2

Don't really need mmap. smol+nuclei+2 fd should be enough.

iacore avatar Apr 08 '23 08:04 iacore

With mmap support I'm not sure how relevant this is now. It doesn't do much actual work when setting up the tensors.

philpax avatar Apr 24 '23 01:04 philpax