Guoli Yin
Guoli Yin
how can we set up the Kubernetes to request multi-node multi-gpu for serving model-parallelism or tensor-parallelism mentioned in FasterTransformer backend or other model parallelism by pytorch/tensorflow? The current aws k8s...
hf alibi reference: https://github.com/huggingface/transformers/blob/a7cab3c283312b8d4de5df3bbe719971e24f4281/src/transformers/models/bloom/modeling_bloom.py#L82 axlearn alibi reference: https://github.com/apple/axlearn/blob/b92f666f661e6bacd757d7a37f1691d4f8985655/axlearn/common/adapter_torch.py#L561 https://github.com/apple/axlearn/blob/b92f666f661e6bacd757d7a37f1691d4f8985655/axlearn/common/attention.py#L3536 current mlx: https://github.com/ml-explore/mlx/blob/44c1ce5e6af2625571cd384e5be49e9778770ffc/python/mlx/nn/layers/positional_encoding.py#L184
Expose - max_decode_len for beam_search to causal_lm module - eos_token_id for sample_decode to causal_lm module It looks like both have been added unit test in decoding_test.py. This change is to...
Background: - This would be useful for fine tuning that has few hundred steps to log more frequently to see the loss pattern