Yangyi Chen

Results 8 comments of Yangyi Chen

Hi Zhuosheng, Nice work! I'd like to follow this work and for a fair comparison, could you please provide some information about the train/dev/test split since I need to locate...

Hi Vishaal, Thanks for your interest. I personally really wish to release all the code & pretrained models of this paper. However, this work was conducted during my internship at...

For some further information, I use a single node, multi-GPU distributed training. When waiting for a long time, I received the following messages: `[rank0]: return Variable._execution_engine.run_backward( # Calls into the...

Hi, Thanks for the follow-up question. I basically use the default setting as in the ./train_configs/llama3_8b.toml file. [training] batch_size = 1 seq_len = 8192 # 8192 # 16384 warmup_steps =...

Yes. It can happen (one data parallel rank uses the linear layer and the others do not). SO it seems like the current implementation doesn't support such function, right? Yes...

Just one quick question. When we run the dummy input through the added linear layer, do we need to compute the gradient for the linear layer regarding this dummy part?...