junphine comments

Results 22 comments of


                                            junphine

[Feature] Support variable-length sequences for mamba block

I copy some method in MixerModel to help use this feature. def unpad_input(self, hidden_states, attention_mask): hidden_states = rearrange(hidden_states, "b s ... -> (b s) ...") valid_mask = attention_mask.squeeze(1).squeeze(1).eq(1) # some...

Compared to CausalFullAttention, Taylor is slow to train and use more GPU

![企业微信截图_17092051833001](https://github.com/lucidrains/taylor-series-linear-attention/assets/4304230/51a1715a-e6f2-4f65-9940-c65c358a3aa6) also taylor loss reduction is slow than full attention

use transfusion implement text to speech

@lucidrains Yes，For the most part refer to e2_tts_pytorch.trainer. py trainning code: `` for ind in range(self.start_step,self.num_train_steps): step = ind + 1 self.model.train() if self.accelerator is not None: with self.accelerator.accumulate(self.model): data...

use transfusion implement text to speech

@lucidrains Thank you! not public dataset，just some mp3 song，I random choose 1 minute clip as a sample.

Please confirm something about the error of modality_shape size in sample

@lucidrains Beacuase the modality shape is always parsed from language model output,the fixed_modality_shape is not used! Maybe meta shape use modality encoder(such as vae) output shape is good idea!

Usage with x-transformers

I use PEER and PKAttention in middle layer of transformers which is 12 layers. ` pk_attn = PKAttention(dim=1536, num_key_values=200x200,pre_rmsnorm=True) peer_mlp = PEER( dim = 1536, heads = 8, num_experts =...

Usage with x-transformers

![image](https://github.com/user-attachments/assets/ee105ec8-042d-43cd-b070-f1d015ac7d8c)

Usage with x-transformers

@lucidrains Yes, PEERLora is much more stable, with init: self.proj_in.weight.normal_(std=dim**-0.5) self.proj_out.weight.normal_(std=dim_inner**-0.5) self.proj_in_lora_a.weight.normal_(std=dim**-0.5) self.proj_in_lora_b.weight.normal_(std=dim_inner**-0.5) self.proj_out_lora_a.weight.normal_(std=dim_inner**-0.5) self.proj_out_lora_b.weight.normal_(std=dim**-0.5) But it should takes longer training time to verify. Because I find the value of...

Usage with x-transformers

![image](https://github.com/user-attachments/assets/d40c28b9-64a9-4230-8d46-933b300ea22e)

Usage with x-transformers

@lucidrains Unfortunate, the PEERLora layer didn't seem to be beneficial, when I removed it (replaced it with MLP) or added it, the ppl curve didn't change at all. The two...