Jesse Cambon
Jesse Cambon
I've also run into slow inference speeds when using deepspeed stage 3 with `Trainer.predict()`. I'm seeing around 3.7 batches/second on a V100 (batch_size=1, big bird base model, imdb dataset, sequence...
@awaelchli I've been testing this on Azure and the new `AzureOpenMPIEnvironment` environment is detected and behaves as expected with the exception of single node multi-gpu setups. For some reason, when...
@SeanNaren, thanks that solution worked nicely. I added this code to the end of the script in the original post: ```python from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict aggregated_checkpoint_path="checkpoints/aggregated" hf_checkpoint_path="checkpoints/hf_checkpoint" # Convert sharded...
@SeanNaren one follow up on this - although the code above worked for a local environment (1 gpu on 1 node), I run into this error when `convert_zero_checkpoint_to_fp32_state_dict` runs on...
There is probably a more elegant solution, but what ended up working for me was to have the root process of each node upload the files to Azure immediately after...
Hi @lrsulli do you have a reproducible example you could share? (ie. a few of the addresses that have caused this error would be helpful so I can try to...
I didn't get any coordinates returned for these addresses, however I also didn't get any errors even after running it multiple times. Let me know if you are able to...
@elfluffybunny @lrsulli I reached out to the Census and they told me that it looks like the geocoder service is occasionally returning HTML content for some reason (it is supposed...
@lrsulli I made a script to attempt to reproduce the error and record the raw HTML response. However, I'm not able to reproduce the error even after running this several...
Hi @xiaochuanfang has this happened multiple times? Can you please post the results of `devtools::session_info()`?