Matthias Reso

Results 46 comments of Matthias Reso

Closing this issue due to inactivity, feel free to reopen if there are further questions.

@Tejaswgupta thanks for flagging this. We need to revisit the saving logic. You selected run_validation: bool=False in your config which effectively disables saving of the result. I'll try to create...

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being...

Yes, your eval loss is NaN so no checkpoint gets saved: ``` evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28

Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized?

Hi, I've seen this error message in different places and it seems to be rather a side effect than the actual cause of the crash. Can you elaborate a bit...

Hi, the eval loss being inf will prevent saving the checkpoint as we compare against an initial best eval loss of inf Comparison is [here](https://github.com/facebookresearch/llama-recipes/blob/322522e9a272c60df7c07ff738a464676ba4c086/utils/train_utils.py#L148C29-L148C29) Initial best eval value [here](https://github.com/facebookresearch/llama-recipes/blob/322522e9a272c60df7c07ff738a464676ba4c086/utils/train_utils.py#L80)...

Seems like weights_only is not working in this test: ``` ## Registering my_text_classifier_scripted_v3 model 2024-04-04T17:35:44,573 [DEBUG] epollEventLoopGroup-3-8 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my_text_classifier_scripted_v3 2024-04-04T17:35:44,573 [DEBUG] epollEventLoopGroup-3-8 org.pytorch.serve.wlm.ModelVersionedRefs...

Currently blocked by [6819](https://github.com/ggerganov/llama.cpp/issues/6819)

Hi @pengxin233 yes, it will still aggregates 10 requests (or wait until max batch delay) to perform the inference. The inference method of the handler will only see a single...