FlagAI
FlagAI copied to clipboard
[Question]: What's the difference between your model and LLaMA?
Description
I have some questions for the AQUILA:
- What's the difference between your model and LLaMA?
- Would you like to share the performance comparson with original LLaMA?
- Is it a re-implementation of original LLaMA with complete training recipe?
- Does it violate the original license of LLaMA?
The configuration seems like a copy of Huggingface LLaMA:
- Original LLaMA config from Huggingface repo: LLaMA config
- AQUILA config https://github.com/FlagAI-Open/FlagAI/blob/0634ab460d4632a4d0cffd3df4c8ceed88846a42/flagai/model/aquila_model.py#L20
class AQUILAConfig(dict):
r"""
This is the configuration class to store the configuration of a [`~LLaMAModel`]. It is used to instantiate an LLaMA
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the LLaMA-7B.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`~LLaMAModel`] or [`~TFLLaMAModel`].
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
tie_word_embeddings(`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
Example:
```python
>>> from transformers import LLaMAModel, LLaMAConfig
>>> # Initializing a LLaMA aquila-7b style configuration
>>> configuration = LLaMAConfig()
>>> # Initializing a model from the aquila-7b style configuration
>>> model = LLaMAModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
### Alternatives
_No response_
The Aquila language model inherits the architectural design advantages of GPT-3, LLaMA, and others. It replaces a batch of more efficient underlying operator implementations, redesigns and implements a bilingual tokenizer for Chinese and English, upgrades the BMTrain parallel training method, and achieves training efficiency nearly 8 times that of Magtron+DeepSpeed zero-2 during the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various optimization methods, it achieves better performance than other open-source models in smaller datasets and shorter training time. It is also the first large-scale open-source language model that supports bilingual knowledge, commercial license agreements and complies with domestic data compliance requirements.
Performance comparison with LLaMA will be presented in the FlagEval (https://flageval.baai.ac.cn/) leaderboard later.
What is the difference between the training data distribution of Aquila and that of LLaMA? We used 40% Chinese pre-training corpus, with an English-to-Chinese ratio of about 2:1.
We did not use the implementation code from LLaMA. Compared to the LLaMA article, we have made many different implementations. The Aquila code implementation is based on the FlagAI model design. The weights of Aquila were trained from scratch by BAAI and comply with the "BAAI Aquila Model License" This agreement is unrelated to the LLaMA license.
先关闭,如有问题重新打开issue,谢谢