Results 22 comments of junphine

File "train/uie/utils.py", line 163, in convert_example encoded_inputs = tokenizer(text=[example["prompt"]], File "/root/anaconda3/lib/python3.9/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2233, in __call__ return self.batch_encode( File "/root/anaconda3/lib/python3.9/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2439, in batch_encode return self._batch_encode_plus( File "/root/anaconda3/lib/python3.9/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1128, in...

support add sanf filter for textquery like scanfquery

I think if you want pad+mask to be effective, you need to do pre-training without using a full sentence in chunk

![1706603511810](https://github.com/state-spaces/mamba/assets/4304230/bf20785b-28eb-4e22-9703-909da9ffbce0) In my experiments, averaging the weights seems to speed up training.

red is megered model. cyan is one of the there models. LR is same. But the training data is new to the cyan model and previously visible to the red.

medusa_logits = logits[i, :, : -(2 + i)].contiguous() medusa_labels = labels[..., 2 + i :].contiguous() Why use 2 as start gap for logits and label align?