localGPT Comprehensive Refactoring and Enhancement of Codebase

Comprehensive Refactoring and Enhancement of Codebase

Open teleprint-me opened this issue 1 year ago • 30 comments

Summary:

Refactoring and Code Organization: The codebase has been extensively refactored for better organization and readability. This includes abstracting model loading into a separate ModelLoader class, using package constants to replace hardcoded strings, and abstracting Chroma into ChromaDBLoader.
Documentation and Commenting: Documentation across the codebase has been significantly improved, with docstrings added for better understanding of the code functionality. Comments have been cleaned up for better readability, and notes about handling safetensors have been moved for better visibility.
Command-Line Interface (CLI) Improvements: The CLI has been enhanced for a better user experience. This includes abstracting CLI choices to package constants and adding command-line options for source_directory and persist_directory.
Code Quality and Standards: Measures have been taken to improve code quality and adhere to coding standards. This includes adding mypy rules to settings for improved type checking, applying isort to imports for better code structure, and fixing model identifier references.
Bug Fixes and Enhancements: A significant number of bugs were fixed during the refactoring process. The use of apt identifiers for model constants was fixed, and a retrieval model was added to ChromaDBLoader for enhanced functionality.

This PR represents a comprehensive effort to improve the codebase's organization, documentation, user interface, code quality, and functionality.

Note: The line width may need to be set to 120. I now understand why it was originally set to 119 before I reset it to the default of 80. This issue will be addressed in future commits.

This PR addresses the following issues and PRs:

Issues: #147, #151, #157, #165, #171, #174, #175, #179
PR: #173

Furthermore, this PR aims to address the following issues:

#92, #108, #111, and others

Jun 25 '23 22:06 teleprint-me

@PromtEngineer @LeafmanZ

It's ready for review and testing.

Let me know of any bugs, exceptions, etcetera.

I'll have the rest ironed out by tonight.

Jun 28 '23 20:06 teleprint-me

Why is flake8 ignoring typing and then something is replacing it with default annotations? It causes the yaml check to fail.

Jun 28 '23 20:06 teleprint-me

I fixed the pre-commit hooks, restored the 120 line length limit, and it's passing now.

run.py should operate as expected. There are certain things I won't be able to test because I don't have the hardware specs to match.

Jun 28 '23 23:06 teleprint-me

@teleprint-me thanks for the updates. I will have a look at it tonight. Just noticed one thing which is probably worth looking at, if someone tries to use another model_id which is not in CHOICE_MODEL_REPOSITORIES & CHOICE_MODEL_SAFETENSORS, I think it will run into an error, right?

We probably want to keep it flexible. Will test it out and see what i can find.

Thanks,

Jun 29 '23 00:06 PromtEngineer

It's defaulting to CPU instead of GPU even though GPU is set. I'm looking into it.

As an FYI, the 7B HF models need ~48 GB of memory.

Jun 29 '23 00:06 teleprint-me

@teleprint-me running into this while trying to run the localGPT.run, here is the full trace:

` (localgpt-dev) prompt@Prompts-MBP localgpt-dev % python -m localGPT.run --device_type cpu

2023-06-28 17:59:58,520 - INFO - model.py:81 - Using AutoModelForCausalLM for full models 2023-06-28 17:59:58,778 - INFO - model.py:84 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF 2023-06-28 18:01:40,540 - INFO - model.py:87 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF Traceback (most recent call last): File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/prompt/Documents/GitHub/localgpt-dev/localGPT/run.py", line 179, in main() File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/prompt/anaconda3/envs/localgpt-dev/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/Users/prompt/Documents/GitHub/localgpt-dev/localGPT/run.py", line 146, in main llm = model_loader.load_model() File "/Users/prompt/Documents/GitHub/localgpt-dev/localGPT/model.py", line 199, in load_model model, tokenizer = self.load_huggingface_model() File "/Users/prompt/Documents/GitHub/localgpt-dev/localGPT/model.py", line 89, in load_huggingface_model model = AutoModelForCausalLM.from_pretrained( TypeError: _BaseAutoModelClass.from_pretrained() missing 1 required positional argument: 'pretrained_model_name_or_path' (localgpt-dev) prompt@Prompts-MBP localgpt-dev % `

Same thing happens when I try mps. For some reasons, it takes 2-3 minutes to model the model. I think we also need to add a check on the device_type when we are selecting which model to use.

Jun 29 '23 01:06 PromtEngineer

@PromtEngineer

It's because of the way I refactored the from_pretrained method call on L89.

20:24:17 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.run
2023-06-28 20:24:28,557 - INFO - model.py:86 - Using AutoModelForCausalLM for full models
2023-06-28 20:24:28,729 - INFO - model.py:89 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-28 20:27:44,412 - INFO - model.py:92 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [08:16<00:00, 248.17s/it]
2023-06-28 20:36:32,244 - INFO - model.py:104 - Model loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-28 20:36:32,249 - WARNING - model.py:109 - Model Weights Tied: Effectiveness depends on specific type of model.
2023-06-28 20:36:32,698 - WARNING - _cpp_lib.py:133 - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.0.1)
    Python  3.11.3 (you have 3.11.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
2023-06-28 20:36:34,114 - INFO - model.py:235 - Local LLM Loaded
2023-06-28 20:36:34,802 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-06-28 20:36:39,759 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-06-28 20:36:39,768 - INFO - json_impl.py:45 - Using ujson library for writing JSON byte strings
2023-06-28 20:36:39,936 - INFO - duckdb.py:506 - loaded in 360 embeddings
2023-06-28 20:36:39,938 - INFO - duckdb.py:518 - loaded in 1 collections
2023-06-28 20:36:39,941 - INFO - duckdb.py:107 - collection with name langchain already exists, returning existing collection
2023-06-28 20:36:39,946 - INFO - run.py:152 - Show Sources: False

Enter a query: What rights are protected by the First Amendment?
^C
Aborted!
2023-06-28 21:44:09,889 - INFO - duckdb.py:460 - Persisting DB to disk, putting it in the save folder: /home/austin/Documents/code/git/localGPT/DB

Jun 29 '23 01:06 teleprint-me

@teleprint-me I think we will need to add the device_type check and default to load_huggingface_llama_model if we the device_type is mps or cpu. That seems to work. I haven't tested it on cuda yet.

Jun 29 '23 02:06 PromtEngineer

@PromtEngineer

What arguments did you use? It errors out for me if I use device_type. The params are poorly defined and the source is deeply nested kwargs. It's turtles all the way down. There are modifications at each method/function call too.

21:45:58 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.run
2023-06-28 21:47:10,837 - INFO - model.py:89 - Using AutoModelForCausalLM for full models
2023-06-28 21:47:11,018 - INFO - model.py:92 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-28 21:50:16,740 - INFO - model.py:95 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/austin/Documents/code/git/localGPT/localGPT/run.py", line 179, in <module>
    main()
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/run.py", line 146, in main
    llm = model_loader.load_model()
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/model.py", line 219, in load_model
    model, tokenizer = self.load_huggingface_model()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/model.py", line 97, in load_huggingface_model
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2675, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LlamaForCausalLM.__init__() got an unexpected keyword argument 'device_type'
21:50:18 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.run
2023-06-28 21:57:37,620 - INFO - model.py:89 - Using AutoModelForCausalLM for full models
2023-06-28 21:57:37,778 - INFO - model.py:92 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-28 22:00:50,870 - INFO - model.py:95 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF
Loading checkpoint shards:   0%|                                                                                                                                                                                        | 0/2 [00:21<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/austin/Documents/code/git/localGPT/localGPT/run.py", line 179, in <module>
    main()
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/run.py", line 146, in main
    llm = model_loader.load_model()
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/model.py", line 221, in load_model
    model, tokenizer = self.load_huggingface_model()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/Documents/code/git/localGPT/localGPT/model.py", line 97, in load_huggingface_model
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 720, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/austin/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 167, in set_module_tensor_to_device
    new_value = value.to(device)
                ^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 86.00 MiB (GPU 0; 8.00 GiB total capacity; 7.91 GiB already allocated; 92.00 MiB free; 7.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

The second error is the correct arguments, which is device_map, not device_type. I knew the model would fail to load, but I needed to test it to see if it would register the GPU.

Jun 29 '23 02:06 teleprint-me

I missed what you were originally stating with your reference to load_huggingface_llama_model and confused it with load_huggingface_model. I'll look into it.

Jun 29 '23 03:06 teleprint-me

@teleprint-me we probably want to simplify the implementation a bit more.

I was using the following parameters:

python -m localGPT.run --device_type mps and python -m localGPT.run --device_type cpu

The readme states to use device_type. Not sure I am following the device_map. Where do we have to set this?

Jun 29 '23 03:06 PromtEngineer

@PromtEngineer

It's the mock up code from the original run script: auto#transformers.AutoModelForCausalLM.from_pretrained

# run_localGPT.py
def load_model(device_type, model_id, model_basename=None):
# other source
    elif (
        device_type.lower() == "cuda"
    ):  # The code supports all huggingface models that ends with -HF or which have a .bin
        # file in their HF repo.
        logging.info("Using AutoModelForCausalLM for full models")
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        logging.info("Tokenizer loaded")

        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="auto",
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
            # max_memory={0: "15GB"} # Uncomment this line with you encounter CUDA out of memory errors
        )
        model.tie_weights()
# rest of source

Which I generalized for Hugging Face models:

# model.py
class ModelLoader:
     def load_huggingface_model(self):
        """
        Loads a full model for text generation.

        Returns:
        - model: The loaded full model.
        - tokenizer: The tokenizer associated with the model.
        """
        logging.info("Using AutoModelForCausalLM for full models")

        config = AutoConfig.from_pretrained(self.model_repository)
        logging.info(f"Configuration loaded for {self.model_repository}")

        tokenizer = AutoTokenizer.from_pretrained(self.model_repository)
        logging.info(f"Tokenizer loaded for {self.model_repository}")

        kwargs: dict[str, object] = {
            "low_cpu_mem_usage": True,
            "resume_download": True,
            "trust_remote_code": False,
            # NOTE: Uncomment this line if you encounter CUDA out of memory errors
            # "max_memory": {0: "7GB"},
            # NOTE: According to the Hugging Face documentation, `output_loading_info` is
            # for when you want to return a tuple with the pretrained model and a dictionary
            # containing the loading information.
            # "output_loading_info": True,
        }

        if self.device_type != "cpu":
            kwargs["device_map"] = self.device_type
            kwargs["torch_dtype"] = torch.float16

        try:
            model = AutoModelForCausalLM.from_pretrained(self.model_repository, config=config, **kwargs)
        except (OutOfMemoryError,) as e:
            logging.error("Encountered CUDA out of memory error while loading the model.")
            logging.error(str(e))
            sys.exit(1)

        logging.info(f"Model loaded for {self.model_repository}")

        if not isinstance(model, tuple):
            model.tie_weights()
            logging.warning("Model Weights Tied: Effectiveness depends on the specific type of model.")

        return model, tokenizer

The model is loaded via a string, not a device type.:

    def load_model(self):
        """
        Loads the appropriate model based on the configuration.

        Returns:
        - local_llm: The loaded local language model (LLM).
        """
        # NOTE: This should be replaced with mapping for smooth extensibility
        if self.model_type.lower() == "huggingface":
            model, tokenizer = self.load_huggingface_model()
        elif self.model_type.lower() == "huggingface-llama":
            model, tokenizer = self.load_huggingface_llama_model()
        elif self.model_type.lower() == "gptq":
            model, tokenizer = self.load_gptq_model()
        elif self.model_type.lower() == "ggml":
            raise NotImplementedError("GGML support is in research and development")
        else:
            raise AttributeError(
                "Unsupported model type given. "
                "Expected one of: "
                "huggingface, "
                "huggingface-llama, "
                "ggml, "
                "gptq"
            )

        # Load configuration from the model to avoid warnings
        generation_config = GenerationConfig.from_pretrained(self.model_repository)
        # see here for details:
        # https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.from_pretrained.returns

        # Create a pipeline for text generation
        local_llm = self.create_pipeline(model, tokenizer, generation_config)

        logging.info("Local LLM Loaded")

        return local_llm

The device type depends on the required parameters. So it depends on the class and from_pretained method parameters and what and how they accept them.

Jun 29 '23 03:06 teleprint-me

It should work for HF models now:

00:03:02 | ~/Documents/code/git/localGPT
 git:(dev | θ) λ python -m localGPT.run    
2023-06-29 00:03:09,757 - INFO - model.py:81 - Using AutoModelForCausalLM for full models
2023-06-29 00:03:09,908 - INFO - model.py:84 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-29 00:06:10,480 - INFO - model.py:87 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF
Loading checkpoint shards:   0%|                                                                                                                                                                                        | 0/2 [00:03<?, ?it/s]
2023-06-29 00:06:13,861 - ERROR - model.py:108 - Encountered CUDA out of memory error while loading the model.
2023-06-29 00:06:13,861 - ERROR - model.py:109 - HIP out of memory. Tried to allocate 86.00 MiB (GPU 0; 8.00 GiB total capacity; 7.91 GiB already allocated; 92.00 MiB free; 7.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
00:06:15 | ~/Documents/code/git/localGPT
 git:(dev | θ) λ python -m localGPT.run --device_type cpu
2023-06-29 00:06:48,927 - INFO - model.py:81 - Using AutoModelForCausalLM for full models
2023-06-29 00:06:49,080 - INFO - model.py:84 - Configuration loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-29 00:09:55,380 - INFO - model.py:87 - Tokenizer loaded for TheBloke/vicuna-7B-1.1-HF
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.45s/it]
2023-06-29 00:10:24,506 - INFO - model.py:112 - Model loaded for TheBloke/vicuna-7B-1.1-HF
2023-06-29 00:10:24,508 - WARNING - model.py:116 - Model Weights Tied: Effectiveness depends on the specific type of model.
2023-06-29 00:10:24,765 - WARNING - _cpp_lib.py:133 - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.0.1)
    Python  3.11.3 (you have 3.11.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
2023-06-29 00:10:25,872 - INFO - model.py:242 - Local LLM Loaded
2023-06-29 00:10:26,667 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-06-29 00:10:31,503 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-06-29 00:10:31,512 - INFO - json_impl.py:45 - Using ujson library for writing JSON byte strings
2023-06-29 00:10:31,675 - INFO - duckdb.py:506 - loaded in 360 embeddings
2023-06-29 00:10:31,677 - INFO - duckdb.py:518 - loaded in 1 collections
2023-06-29 00:10:31,679 - INFO - duckdb.py:107 - collection with name langchain already exists, returning existing collection
2023-06-29 00:10:31,680 - INFO - run.py:152 - Show Sources: False

Enter a query: ^C
Aborted!
2023-06-29 00:11:01,189 - INFO - duckdb.py:460 - Persisting DB to disk, putting it in the save folder: /home/austin/Documents/code/git/localGPT/DB

Jun 29 '23 04:06 teleprint-me

@teleprint-me thanks, it works for cpu however, using load_huggingface_model for mps doesn't work. Need to look into it why. MPS seems to work with load_huggingface_llama_model. Can we have a check in load_model function that uses load_huggingface_llama_model if the device_type is mps.

Jun 29 '23 04:06 PromtEngineer

@PromtEngineer

Start here: https://github.com/huggingface/transformers/blob/v4.30.0/src/transformers/models/auto/auto_factory.py#L432

Just follow the references on your local machine starting from source if you want. That's how I did it.

In most cases, you'll find that cuda and cpu are the only real options. For example, I don't need to run it with hip. If I did, I'd raise an error. It would complain that it hadn't been compiled with hip and exit even though it isn't true at all. So, I use the cuda device type instead.

The underlying torch library that's buried in there would normally allow us to do it with .to(device_type) moving the model into the supported devices memory.

We don't have access to it because it's so abstracted by wrappers. It's layers and layers of the same code stacked on top of everything else. It's kind of crazy actually.

Like, I'd rather just deal with the raw code my self (numpy, scikit, pytorch) and be done with it. I get the appeal in ggml more and more.

Jun 29 '23 04:06 teleprint-me

@teleprint-me I agree, let me have a look at it. Right now we need for the code to work on cuda, cpu and mps as users are expected to have one of them.

I will have a look at it and see how we can support it. Can you look at #198 and see if that's something we can integrate for ggml support?

Jun 29 '23 05:06 PromtEngineer

I tend to stay away from apple hardware and software. I'm not much help there unfortunately. I do understand the basics though for mac os x, that's pretty much it. I probably won't be much help there since its not my specialty.

As for cuda, it all defaults to cuda, so cudas not even a consideration here.

CPU is a challenge that can be mitigated. I already found a solution for bitsandbytes, I just haven't had the time or desire to implement it.

ROCm kind of works, still figuring that one out. I did get it to work with the original run script, so I'm a bit bummed out that I couldn't get GPTQ to work because I could fit a 7b 8-bit or 4-bit quant model my RX 580. I changed something and it's been bugging me because I can't figure out what I'm missing.

Also, I have other projects I'm eager to work on, like my pygptprompt cli tool.

For the most part, we're actually a bit ahead of where I started originally, so that's some progress.

I'll check out the llama code since I have the weights already. I haven't upgraded my GPU because I'm still researching and I need to figure out some other stuff first, then I can go for 4090.

We can use a CMakeLists.txt and build a Makefile based on the os to handle different environments. Windows users would be using WSL, so that leaves Linux and Mac. It's a potential solution.

Jun 29 '23 05:06 teleprint-me

@PromtEngineer @LeafmanZ

Off to a great start!

03:06:16 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.ggml
llama.cpp: loading model from MODELS/vicuna-7B-1.1-GPTQ-4bit-128g-GGML/vicuna-7B-1.1-GPTQ-4bit-128g.GGML.bin
error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/austin/Documents/code/git/localGPT/localGPT/ggml.py", line 3, in <module>
    llm = Llama(model_path="MODELS/vicuna-7B-1.1-GPTQ-4bit-128g-GGML/vicuna-7B-1.1-GPTQ-4bit-128g.GGML.bin", low_vram=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/lib/python3.11/site-packages/llama_cpp/llama.py", line 286, in __init__
    assert self.model is not None
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Exception ignored in: <function Llama.__del__ at 0x7f1eda3094e0>
Traceback (most recent call last):
  File "/home/austin/.local/lib/python3.11/site-packages/llama_cpp/llama.py", line 1445, in __del__
    if self.ctx is not None:
       ^^^^^^^^
AttributeError: 'Llama' object has no attribute 'ctx'

Jun 29 '23 07:06 teleprint-me

Success!

03:48:38 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.ggml --low_vram True --text_input "Hello! What is your name?"
llama.cpp: loading model from MODELS/orca_mini_7B-GGML/orca-mini-7b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time = 11254.73 ms
llama_print_timings:      sample time =     7.93 ms /    16 runs   (    0.50 ms per token,  2018.67 tokens per second)
llama_print_timings: prompt eval time = 11254.66 ms /    48 tokens (  234.47 ms per token,     4.26 tokens per second)
llama_print_timings:        eval time =  4107.84 ms /    15 runs   (  273.86 ms per token,     3.65 tokens per second)
llama_print_timings:       total time = 15405.64 ms
{'id': 'cmpl-9caa15ed-635a-4bda-bccb-5afb8b0504fe', 'object': 'text_completion', 'created': 1688024936, 'model': 'MODELS/orca_mini_7B-GGML/orca-mini-7b.ggmlv3.q4_0.bin', 'choices': [{'text': '### System:\n    You are an AI assistant that follows instruction extremely well. Help as much as you can.\n    ### User:\n    Hello! What is your name?### Response: Hello! My name is AI Assistant. How may I assist you today?', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 48, 'completion_tokens': 15, 'total_tokens': 63}}

Jun 29 '23 07:06 teleprint-me

I've been doing some thinking about our ModelLoader class. Using AutoModel with from_pretrained for differing class definitions, dealing with all sorts of wrappers, libraries, and runners is getting complex quickly. Maybe we could streamline things a bit?

Some thoughts:

What if we use a Factory pattern? We tweak run.py to be more of a hook and use a module or CLI script for model loading and inferencing. Kind of like rearranging our pipeline for better flow.
How about considering a plugin-based architecture? Each model could be its own module that we can develop, maintain, and plug in as needed. This might keep the main app clean and more flexible.
I've also been pondering over our CLI. Perhaps the Command pattern could help us simplify it? Each action gets its own command, reducing confusion.
And I've been mulling over Dependency Injection. It could be beneficial for testing and might give us more flexibility and less hard-coding.

In terms of new innovations, I've been experimenting with the python-llama.cpp implementation, which uses the ggml underlying C/Cpp library. Haven't integrated it yet, but it's a good example of something that could potentially break our conventional patterns.

Thinking along these lines, perhaps we could introduce unique scripts like hf.py, gptq.py, llama.py, and ggml.py for different models and types. This could handle the nuances of each type, including specifying the hardware device to load the model to, handling abstractions, or dealing with requirements like pre-compilation in the future.

I know we're all keen to merge this branch and keep things moving. It's a bit of pressure, especially for a hobbyist like myself. So how about we consider the current design as a stepping stone? It's a solid base, but it also leaves room for others to contribute, to bring fresh ideas to the table, e.g. #198.

I'm also toying with the idea of a Test-Driven Development (TDD) approach. Might help to make our core code more reliable. What do you guys think?

The current plan is to prepare and commit what we have, then keep refining from there.

I'm eager to hear your thoughts on these suggestions.

Jun 29 '23 20:06 teleprint-me

I'm also working on hf_hub_download implementation to handle automating the retrieval for GGML models.

17:16:53 | ~/Documents/code/git/localGPT
 git:(dev | Δ) λ python -m localGPT.ggml --text_input "Hello! What is your name?"   
2023-06-29 17:17:00,028 - INFO - ggml.py:35 - Using /home/austin/.cache/huggingface/hub/models--TheBloke/orca_mini_7B-GGML/orca-mini-7b.ggmlv3.q4_0.bin to load TheBloke/orca_mini_7B-GGML
2023-06-29 17:17:00,028 - INFO - ggml.py:38 - Model not found locally. Downloading TheBloke/orca_mini_7B-GGML from HuggingFace Model Hub.
Downloading (…)i-7b.ggmlv3.q4_0.bin: 100%|███████████████████████████████████████████████████████████████████████████████| 3.79G/3.79G [03:10<00:00, 19.9MB/s]
2023-06-29 17:20:10,947 - INFO - ggml.py:44 - Using /home/austin/.cache/huggingface/hub/models--TheBloke--orca_mini_7B-GGML/snapshots/709dfca2e5523319777fd59fa522ea4e32a33d93/orca-mini-7b.ggmlv3.q4_0.bin to load TheBloke/orca_mini_7B-GGML into memory
llama.cpp: loading model from /home/austin/.cache/huggingface/hub/models--TheBloke--orca_mini_7B-GGML/snapshots/709dfca2e5523319777fd59fa522ea4e32a33d93/orca-mini-7b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time = 11105.36 ms
llama_print_timings:      sample time =    12.29 ms /    26 runs   (    0.47 ms per token,  2115.02 tokens per second)
llama_print_timings: prompt eval time = 11105.31 ms /    49 tokens (  226.64 ms per token,     4.41 tokens per second)
llama_print_timings:        eval time =  6661.96 ms /    25 runs   (  266.48 ms per token,     3.75 tokens per second)
llama_print_timings:       total time = 17836.58 ms
{'id': 'cmpl-cff27ccf-2cef-4ac3-b1a6-2c8dcdca7074', 'object': 'text_completion', 'created': 1688073611, 'model': '/home/austin/.cache/huggingface/hub/models--TheBloke--orca_mini_7B-GGML/snapshots/709dfca2e5523319777fd59fa522ea4e32a33d93/orca-mini-7b.ggmlv3.q4_0.bin', 'choices': [{'text': '### System:\nMy name is Orca. I am an AI assistant that follows instruction extremely well. I am a very helpful assistant.\n\n### User:\nHello! What is your name?\n\n### Response: Hello! My name is Orca. I am an AI assistant designed to help you with various tasks and answer your questions.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 49, 'completion_tokens': 25, 'total_tokens': 74}}

Still working out the kinks though.

Jun 29 '23 21:06 teleprint-me

@teleprint-me I appreciate all the efforts you are putting into this. Grateful to you and others for that.

I like the idea of plugin-based architecture. That will make things more streamlined. There are so many things we can do but let's do them one step at a time. I agree, let's get this one out and then we can keep improving on top of it.

I like your suggestions and these can be integrated later on. Will be great to see how others contribute to it.

I want to spend some time testing and revising the code over the weekend and will merge it either on Sunday or Monday.

Jun 30 '23 06:06 PromtEngineer

@teleprint-me I was testing it after your recent changes and it seems to not be able to create the index. I am getting the following error trace:

Enter a query: What is the term limit of the president? Traceback (most recent call last): File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/prompt/Documents/GitHub/local_dev/localGPT/run.py", line 180, in main() File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/Users/prompt/Documents/GitHub/local_dev/localGPT/run.py", line 161, in main res = qa(query) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in call raise e File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/chains/base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 119, in _call docs = self._get_docs(question) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 181, in _get_docs return self.retriever.get_relevant_documents(question) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/vectorstores/base.py", line 376, in get_relevant_documents docs = self.vectorstore.similarity_search(query, **self.search_kwargs) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 182, in similarity_search docs_and_scores = self.similarity_search_with_score(query, k, filter=filter) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 230, in similarity_search_with_score results = self.__query_collection( File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/utils.py", line 53, in wrapper return func(*args, **kwargs) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 121, in __query_collection return self._collection.query( File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 219, in query return self._client._query( File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/chromadb/api/local.py", line 408, in _query uuids, distances = self._db.get_nearest_neighbors( File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/chromadb/db/clickhouse.py", line 583, in get_nearest_neighbors uuids, distances = index.get_nearest_neighbors(embeddings, n_results, ids) File "/Users/prompt/anaconda3/envs/local_dev/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py", line 230, in get_nearest_neighbors raise NoIndexException( chromadb.errors.NoIndexException: Index not found, please create an instance before querying

In the DB folder, I don't need the index folder. When you get a chance, please have a look.

Jul 03 '23 08:07 PromtEngineer

@PromtEngineer Can you provide the command line and options you used? It's difficult to say without that information. I would need to see the full output.

Jul 03 '23 08:07 teleprint-me

I tested it with device_type cpu and mps, have yet to try with cuda

From: Austin @.> Sent: Monday, July 3, 2023 1:47:54 AM To: PromtEngineer/localGPT @.> Cc: PromptEngineer @.>; Mention @.> Subject: Re: [PromtEngineer/localGPT] Comprehensive Refactoring and Enhancement of Codebase (PR #180)

@PromtEngineerhttps://github.com/PromtEngineer Can you provide the command line and options you used?

— Reply to this email directly, view it on GitHubhttps://github.com/PromtEngineer/localGPT/pull/180#issuecomment-1617650136, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BAB6XLLJ2RKDA26XQEKGJNTXOKBLRANCNFSM6AAAAAAZTMV24U. You are receiving this because you were mentioned.Message ID: @.***>

Jul 03 '23 15:07 PromtEngineer

Have you tried deleting the DB directory and starting fresh?

Jul 03 '23 15:07 teleprint-me

Yes, that didn't work. I was able to run an earlier commit.

From: Austin @.> Sent: Monday, July 3, 2023 8:07:58 AM To: PromtEngineer/localGPT @.> Cc: PromptEngineer @.>; Mention @.> Subject: Re: [PromtEngineer/localGPT] Comprehensive Refactoring and Enhancement of Codebase (PR #180)

Have you tried deleting the DB directory and starting fresh?

— Reply to this email directly, view it on GitHubhttps://github.com/PromtEngineer/localGPT/pull/180#issuecomment-1618578082, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BAB6XLPTUAMHSZG7WQYE5RTXOLN4ZANCNFSM6AAAAAAZTMV24U. You are receiving this because you were mentioned.Message ID: @.***>

Jul 03 '23 15:07 PromtEngineer

Which commit?

Jul 03 '23 15:07 teleprint-me

That was the one before the restructuring of the dB into package. Will look it up when I have access to my laptop.

From: Austin @.> Sent: Monday, July 3, 2023 8:11:48 AM To: PromtEngineer/localGPT @.> Cc: PromptEngineer @.>; Mention @.> Subject: Re: [PromtEngineer/localGPT] Comprehensive Refactoring and Enhancement of Codebase (PR #180)

Which commit?

— Reply to this email directly, view it on GitHubhttps://github.com/PromtEngineer/localGPT/pull/180#issuecomment-1618590421, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BAB6XLPQ2FVOPTAQUZC4GX3XOLOLJANCNFSM6AAAAAAZTMV24U. You are receiving this because you were mentioned.Message ID: @.***>

Jul 03 '23 15:07 PromtEngineer

I'll triple check it.

It worked last time I tested it, but I knew it wouldn't guarantee functionality in a general sense. That's why I needed people to test it. That and I'm genuinely bound to lower end models because my specs are out of date. Personal circumstances restricted my finances, so I won't be upgrading anytime soon. It's why I spent so much time figuring out llama-cpp-python over the past few days.

I'll be mostly focused on GGML and quantized models from this point forward, that or a full model that is less than 1.8B parameters because that's all my current GPU can handle.

I got GPTQ to work for a day and haven't been able to figure out what went wrong with it. I'll revisit that in future when I have more time.

One thing is clear and that's that I'll be focused on smaller models for lower end hardware.

Moral of the story is that the only way I'll be able to effectively test it is to setup the ggml script and plug it into langchain and pass it as the argument which I'm currently working on in pygptprompt project.

I'll port, refactor, and update the source once it's ready. I don't know when that will be. I'm on my 5th cup of coffee and pulled an all nighter. I'm really motivated to get basic functionality out of it.

I provided the results from my sessions with the Orca Mini model in the docs part of my repo if you're curious.

I'll update you regardless and keep in touch and lmk if anything.

Jul 03 '23 15:07 teleprint-me

localGPT localGPT copied to clipboard

Comprehensive Refactoring and Enhancement of Codebase

localGPT
localGPT copied to clipboard