BB3-30B outputs gibberish
Bug description
Running parlai interactive on BB3-30B produces gibberish output.
Reproduction steps
After resharding Blenderbot3 30B consolidated.pt into two parts reshard-model_part-{0 or 1}.pt, I run metaseq-api-local with the following constants.py in metaseq. I am running the model on 2 A100-40GB gpus.
MAX_SEQ_LEN = 1024
BATCH_SIZE = 64 # silly high bc we dynamically batch by MAX_BATCH_TOKENS
MAX_BATCH_TOKENS = 1024
DEFAULT_PORT = 6010
MODEL_PARALLEL = 2
TOTAL_WORLD_SIZE = 2
MAX_BEAM = 4
try:
# internal logic denoting where checkpoints are in meta infrastructure
from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
CHECKPOINT_FOLDER = "/path/to/bb3_30B/resharded"
# tokenizer files
BPE_MERGES = os.path.join(CHECKPOINT_FOLDER, "gpt2-merges.txt")
BPE_VOCAB = os.path.join(CHECKPOINT_FOLDER, "gpt2-vocab.json")
MODEL_FILE = os.path.join(CHECKPOINT_FOLDER, "reshard.pt")
LAUNCH_ARGS = [
f"--model-parallel-size {MODEL_PARALLEL}",
f"--distributed-world-size {TOTAL_WORLD_SIZE}",
"--task language_modeling",
f"--bpe-merges {BPE_MERGES}",
f"--bpe-vocab {BPE_VOCAB}",
"--bpe hf_byte_bpe",
f"--merges-filename {BPE_MERGES}", # TODO(susanz): hack for getting interactive_hosted working on public repo
f"--vocab-filename {BPE_VOCAB}", # TODO(susanz): hack for getting interactive_hosted working on public repo
f"--path {MODEL_FILE}",
"--beam 1 --nbest 1",
"--distributed-port -1",
"--checkpoint-shard-count 1",
"--use-sharded-state",
f"--batch-size {BATCH_SIZE}",
f"--buffer-size {BATCH_SIZE * MAX_SEQ_LEN}",
f"--max-tokens {BATCH_SIZE * MAX_SEQ_LEN}",
"/tmp", # required "data" argument.
]
Once the metaseq api server is running, I run the command parlai interactive --init-opt gen/opt_bb3 --loglevel debug --opt-server http://10.233.96.198:6010/ --raw-search-server RELEVANT_SEARCH_SERVER as given in https://parl.ai/projects/bb3/. In interactive mode, I type "Hi!" and the response I get is gibberish, e.g. "analysepolit, the role Superman simplicityurger, thank Daisy personal life and hit the area stop. Coord [...]".
Expected behavior Expected BB3-30B to output a reasonable, low-perplexity output.
Logs Metaseq Log
2022-11-02 04:39:07 | INFO | metaseq.distributed.utils | initialized host bb3-30b-0 as rank 02022-11-02 04:39:07 | INFO | metaseq.distributed.utils | initialized host bb3-30b-0 as rank 1
2022-11-02 04:39:13 | INFO | metaseq.distributed.utils | cfg.common.model_parallel_size: 2> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-11-02 04:39:14 | INFO | metaseq.hub_utils | loading model(s) from /home/jovyan/vol-1/bb3_30B/resharded/reshard.pt
2022-11-02 04:39:28 | INFO | metaseq.checkpoint_utils | Loading 2 on 1 DDP workers: 2 files per worker.
2022-11-02 04:39:54 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-11-02 04:39:55 | INFO | metaseq.modules.fused_bias_gelu | Compiling and loading fused kernels
NOTE: If this hangs here, your megatron fused kernels may be corrupted. This can happen if a previous job is interrupted during a build. In that case, delete the megatron build directory and relaunch training. The megatron build directory is located at: /home/jovyan/vol-1/dependencies/Megatron-LM/megatron/fused_kernels/buildDetected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/vol-1/dependencies/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/vol-1/dependencies/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/vol-1/dependencies/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
2022-11-02 04:40:24 | INFO | metaseq.modules.fused_bias_gelu | Done with compiling and loading fused kernels.
2022-11-02 04:40:35 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-11-02 04:40:36 | INFO | metaseq.cli.interactive | loaded model 0
2022-11-02 04:40:37 | INFO | metaseq.cli.interactive | Worker engaged! 10.233.96.198:6010
* Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2022-11-02 04:40:37 | WARNING | werkzeug | * Running on all addresses.
WARNING: This is a development server. Do not use it in a production deployment.
2022-11-02 04:40:37 | INFO | werkzeug | * Running on http://10.233.96.198:6010/ (Press CTRL+C to quit)
ParlAI command log
Enter Your Message: Hi!
04:44:20 | ['Person 1: Hi!\nSearch Decision:']
04:44:20 | Making request: {'prompt': 'Person 1: Hi!\nSearch Decision:', 'min_tokens': 1, 'max_tokens': 10, 'be
st_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0
.3, 'alpha_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_pe
nalty_end_idx': -1}
04:44:21 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 6, 7, 9, 13
, 18, 21, 23, 25, 29], "token_logprobs": [-2.196274757385254, -2.570553779602051, -2.7849178314208984, -2.89641
07036590576, -2.4257678985595703, -2.5294978618621826, -2.1859633922576904, -2.192195177078247, -2.395860433578
491, -2.5915679931640625], "tokens": [" Coord", "I", "'m", " not", " sure", " if", " I", "'m", " not", " sure"]
, "top_logprobs": null}, "text": " CoordI'm not sure if I'm not sure"}], "created": 1667364261, "id": "81140866
-66e7-45c1-9cd5-bab9b235f3a2", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
04:44:21 | Example 0, search_decision_agent: CoordI'm not sure if I'm not sure
04:44:21 | Decision Reply: CoordI'm not sure if I'm not sure; defaulting to no search/memory
04:44:21 | ['Person 1: Hi!\nMemory Decision:']
04:44:21 | Making request: {'prompt': 'Person 1: Hi!\nMemory Decision:', 'min_tokens': 1, 'max_tokens': 10, 'be
st_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0
.3, 'alpha_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_pe
nalty_end_idx': -1}
04:44:21 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 6, 7, 9, 13
, 18, 20, 31, 32, 33], "token_logprobs": [-2.567915201187134, -2.601518154144287, -2.7668521404266357, -2.90058
3505630493, -2.484133243560791, -2.51611590385437, -2.055912733078003, -2.700902223587036, -2.2912964820861816,
-2.240861415863037], "tokens": [" Coord", "I", "'m", " not", " sure", " I", " understand", ".", " ", " "], "to
p_logprobs": null}, "text": " CoordI'm not sure I understand. "}], "created": 1667364261, "id": "abc1cdf4-ccad
-4e83-a6f8-c1b903b0d139", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
04:44:21 | Example 0, memory_decision_agent: CoordI'm not sure I understand.
04:44:21 | Decision Reply: CoordI'm not sure I understand.; defaulting to no search/memory
04:44:21 | ['Person 1: Hi!\nMemory:']
04:44:21 | Making request: {'prompt': 'Person 1: Hi!\nMemory:', 'min_tokens': 1, 'max_tokens': 32, 'best_of': 1
, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0.3, 'alph
a_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_penalty_end
_idx': -1}
04:44:21 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0], "token_log
probs": [-3.3869526386260986], "tokens": ["?"], "top_logprobs": null}, "text": "?"}], "created": 1667364261, "i
d": "a28ef29c-a1fe-4724-a432-1bfd326f4602", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_co
mpletion"}
04:44:21 | Partner Memories: ['?']
04:44:21 | Search Results (50 toks each) for 0:
04:44:21 | ['Person 1: Hi!\nPrevious Topic:']
04:44:21 | Making request: {'prompt': 'Person 1: Hi!\nPrevious Topic:', 'min_tokens': 1, 'max_tokens': 32, 'bes
t_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0.
3, 'alpha_presence': 0.5, 'alpha_frequency': 0.5, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src
_penalty_end_idx': -1}
04:44:23 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 2, 3, 4, 5,
9, 15, 16, 20, 26, 27, 31, 37, 38, 42, 48, 49, 53, 59, 60, 64, 70, 71, 75, 81, 82, 86, 92, 93, 97, 103, 104],
"token_logprobs": [-4.175408363342285, -2.0986480712890625, -2.860321044921875, -2.6778621673583984, -4.1822190
284729, -3.796125650405884, -3.8086047172546387, -3.3810338973999023, -3.2470953464508057, -2.198101758956909,
-3.067448616027832, -2.943135976791382, -1.8681015968322754, -2.6093759536743164, -3.311136245727539, -1.726488
8286590576, -2.122828483581543, -2.8485541343688965, -1.5960482358932495, -1.8479413986206055, -2.6865906715393
066, -1.5328741073608398, -1.5639088153839111, -2.408881425857544, -1.383105993270874, -1.489540934562683, -2.4
37816858291626, -1.2056403160095215, -1.4418892860412598, -2.2102181911468506, -1.0267972946166992, -1.27453124
52316284], "tokens": [" 1", ".", "5", ":", " the", " first", ",", " the", " first", ",", " the", " first", ",",
" the", " first", ",", " the", " first", ",", " the", " first", ",", " the", " first", ",", " the", " first",
",", " the", " first", ",", " the"], "top_logprobs": null}, "text": " 1.5: the first, the first, the first, the
first, the first, the first, the first, the first, the first, the"}], "created": 1667364263, "id": "80e8ad21-0
40a-438b-9219-b8f22795b590", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
04:44:23 | []
04:44:23 | []
04:44:23 | Contextual KNOWLEDGE for example 0: 1.5: the first, the first, the first, the first, the first, the
first, the first, the first, the first, the
04:44:23 | contextual_knowledge: 1.5: the first, the first, the first, the first, the first, the first, the fir
st, the first, the first, the
[...]
04:44:26 | Making request: {'prompt': 'Person 1: Amazoneu toughnessincludes legalize 1600 uneven already probably loot you suites Bac liber, it seemed be signage Kham sub -- is Kham case, 1600 never to be informed of Equal transformativeincludes it is Equal your presiding know Superman Lash as far without nothing, if lost haveAmazonsed sun for questions Superman Lash clean endurance was Helena past alone taken Fisheries Plus enough been sticky63 citizens Fisheries\nMemory:', 'min_tokens': 1, 'max_tokens': 32, 'best_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0.3, 'alpha_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_penalty_end_idx': -1}
04:44:27 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 6, 7, 9, 13, 18, 21, 23, 25, 29, 34, 37, 39, 41, 45, 50, 53, 56, 58, 60, 64], "token_logprobs": [-3.711851119995117, -2.6808736324310303, -2.825667381286621, -2.8728160858154297, -2.362673759460449, -2.4958248138427734, -2.3980820178985596, -2.3303840160369873, -2.3356881141662598, -2.170912981033325, -2.0383222103118896, -2.084571123123169, -2.1660845279693604, -1.7205463647842407, -2.6860268115997314, -2.6316330432891846, -3.2297251224517822, -2.1900012493133545, -2.2008628845214844, -2.1069934368133545, -2.664494514465332], "tokens": [" Coord", "I", "'m", " not", " sure", " if", " I", "'m", " not", " sure", " if", " I", "'m", " not", " sure", " if", " if", " I", "'m", " not", " sure"], "top_logprobs": null}, "text": " CoordI'm not sure if I'm not sure if I'm not sure if if I'm not sure"}], "created": 1667364267, "id": "b859e6c3-5744-4576-baa9-65db88a7c0e5", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
04:44:27 | Self Memories: ["CoordI'm not sure if I'm not sure if I'm not sure if if I'm not sure"]
[...]
04:44:27 | Self-observing for module Module.SEARCH_KNOWLEDGE
04:44:27 | Self-observing for module Module.SEARCH_DECISION
04:44:27 | Self-observing for module Module.MEMORY_DECISION
04:44:27 | Self-observing for module Module.SEARCH_QUERY
04:44:27 | Self-observing for module Module.MEMORY_GENERATOR
04:44:27 | Self-observing for module Module.CONTEXTUAL_KNOWLEDGE
04:44:27 | Self-observing for module Module.MEMORY_KNOWLEDGE
04:44:27 | Self-observing for module Module.CONTEXTUAL_DIALOGUE
04:44:27 | Self-observing for module Module.MEMORY_DIALOGUE
04:44:27 | Self-observing for module Module.SEARCH_DIALOGUE
04:44:27 | Self-observing for module Module.VANILLA_DIALOGUE
04:44:27 | Self-observing for module Module.GROUNDED_DIALOGUE
04:44:27 | Self-observing for module Module.OPENING_DIALOGUE
[BlenderBot3]: Amazoneu toughnessincludes legalize 1600 uneven already probably loot you suites Bac liber, it seemed be signage Kham sub -- is Kham case, 1600 never to be informed of Equal transformativeincludes it is Equal your presiding know Superman Lash as far without nothing, if lost haveAmazonsed sun for questions Superman Lash clean endurance was Helena past alone taken Fisheries Plus enough been sticky63 citizens Fisheries
Additional context Add any other context about the problem here. (like proxy settings, network setup, overall goals, etc.)
Hmm yes there is clearly something wrong with your model. Are you sure you resharded correctly?
Yes, at least I think so? I downloaded 30b running wget http://parl.ai/downloads/_models/bb3/bb3_30B/consolidated.pt . Then resharded according to
CONSOLIDATED=/path/to/bb3_30B/consolidated/
RESHARD=/save/path/to/bb3_30B/resharded/
MP=2
python -m metaseq.scripts.reshard_model_parallel $CONSOLIDATED/consolidated $MP --save-prefix $RESHARD/reshard
so I don't think anything wrong happened here. Is there something I missed?
I just ran through the whole procedure myself and it all worked for me. Have you ensured that you copied the correct dict files to your RESHARDED folder?
Do you mean the following? Then yes...
cd /path/to/resharded-weights
wget https://github.com/facebookresearch/metaseq/raw/main/projects/OPT/assets/gpt2-merges.txt
wget https://github.com/facebookresearch/metaseq/raw/main/projects/OPT/assets/gpt2-vocab.json
I would like to think it's not a dict problem since after saying "hello", I receive the following.
Enter Your Message: hello
05:46:35 | ['Person 1: hello\nSearch Decision:']
05:46:35 | Making request: {'prompt': 'Person 1: hello\nSearch Decision:', 'min_tokens': 1, 'max_tokens': 10, 'best_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0.3, 'alpha_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_penalty_end_idx': -1}
05:46:36 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 6, 7, 9, 13, 18, 21, 23, 25, 31], "token_logprobs": [-2.5833258628845215, -2.599290370941162, -2.832890748977661, -2.938656806945801, -2.4711954593658447, -2.5266294479370117, -2.122969627380371, -2.4317524433135986, -2.474005937576294, -0.8849090337753296], "tokens": [" Coord", "I", "'m", " not", " sure", " if", " I", "'m", " going", " to"], "top_logprobs": null}, "text": " CoordI'm not sure if I'm going to"}], "created": 1668059196, "id": "e242fa90-13cf-4e8a-b971-1c1e34b2e7a0", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
05:46:36 | Example 0, search_decision_agent: CoordI'm not sure if I'm going to
**05:46:36 | Decision Reply: CoordI'm not sure if I'm going to; defaulting to no search/memory**
05:46:36 | ['Person 1: hello\nMemory Decision:']
05:46:36 | Making request: {'prompt': 'Person 1: hello\nMemory Decision:', 'min_tokens': 1, 'max_tokens': 10, 'best_of': 1, 'top_p': -1.0, 'stop': '\n', 'temperature': 1.0, 'echo': False, 'lambda_decay': -1, 'omega_bound': 0.3, 'alpha_presence': 0, 'alpha_frequency': 0, 'alpha_presence_src': 0, 'alpha_frequency_src': 0, 'alpha_src_penalty_end_idx': -1}
05:46:37 | GPT-Z response: {"choices": [{"logprobs": {"finish_reason": "length", "text_offset": [0, 6, 7, 9, 13, 18, 20, 31, 32, 36], "token_logprobs": [-2.3135931491851807, -2.603790521621704, -2.8070077896118164, -2.9075286388397217, -2.27335262298584, -2.4242665767669678, -1.8668214082717896, -2.4841389656066895, -1.4026139974594116, -2.0078699588775635], "tokens": [" Coord", "I", "'m", " not", " sure", " I", " understand", ",", " but", " I"], "top_logprobs": null}, "text": " CoordI'm not sure I understand, but I"}], "created": 1668059197, "id": "f8279983-c46b-45da-9ffe-40e856201404", "model": "/home/jovyan/vol-1/bb3_30B/resharded", "object": "text_completion"}
05:46:37 | Example 0, memory_decision_agent: CoordI'm not sure I understand, but I
**05:46:37 | Decision Reply: CoordI'm not sure I understand, but I; defaulting to no search/memory**
where the decision reply outputs something somewhat coherent " CoordI'm not sure I understand, but I".
Not sure where to even look for an error...
something's up with your metaseq installation, i'd imagine. how did you install that repository? are you on the proper sub-branches for all the other repos (e.g., fairseq_v3 for Megatron)?
I would probably start from scratch (at least for metaseq). This does not concern your parlai installation
Hmm so I should be on the fairseq_v3 branch instead of fairseq_v2 for Megatron? If so, that could possibly be it so I'll give it a shot. I did run what's on https://github.com/facebookresearch/metaseq/blob/main/docs/setup.md from scratch but I'll try it one more time with fairseq_v3.
EDIT: That didn't solve the problem. To answer your first question, I set up the repo with the setup as linked above and used the correct branches. Also updated metaseq to its most recent commit.
can you try re-downloading the model weights and re-sharding?
other than that my only suggestion would be a completely fresh install of metaseq and parlai. I'm unable to repro on my end so not really sure how to proceed
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.