Pierre Colombo

Results 15 comments of Pierre Colombo
trafficstars

Same with this config: `base_model: toto/toto model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer is_mistral_derived_model: true resize_token_embeddings_to_32x: true datasets: - path: data_processed/toto.jsonl ds_type: json type: sharegpt conversation: mistral load_in_8bit: false load_in_4bit: false strict: false...

either i'm not doing something correctly or there is something off 👍 in the way accumulation and micro batch size influences the losses ? Is this related: https://discuss.huggingface.co/t/gradient-accumulation-gives-different-results-compared-to-full-batch/65889 ?

The loss averaging is not handled consistantly but same with eval and training loss when batched !