Jesse Cambon

Results 54 comments of Jesse Cambon

I've also run into slow inference speeds when using deepspeed stage 3 with `Trainer.predict()`. I'm seeing around 3.7 batches/second on a V100 (batch_size=1, big bird base model, imdb dataset, sequence...

@awaelchli I've been testing this on Azure and the new `AzureOpenMPIEnvironment` environment is detected and behaves as expected with the exception of single node multi-gpu setups. For some reason, when...

@SeanNaren, thanks that solution worked nicely. I added this code to the end of the script in the original post: ```python from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict aggregated_checkpoint_path="checkpoints/aggregated" hf_checkpoint_path="checkpoints/hf_checkpoint" # Convert sharded...

@SeanNaren one follow up on this - although the code above worked for a local environment (1 gpu on 1 node), I run into this error when `convert_zero_checkpoint_to_fp32_state_dict` runs on...

There is probably a more elegant solution, but what ended up working for me was to have the root process of each node upload the files to Azure immediately after...

Hi @lrsulli do you have a reproducible example you could share? (ie. a few of the addresses that have caused this error would be helpful so I can try to...

I didn't get any coordinates returned for these addresses, however I also didn't get any errors even after running it multiple times. Let me know if you are able to...

@elfluffybunny @lrsulli I reached out to the Census and they told me that it looks like the geocoder service is occasionally returning HTML content for some reason (it is supposed...

@lrsulli I made a script to attempt to reproduce the error and record the raw HTML response. However, I'm not able to reproduce the error even after running this several...

Hi @xiaochuanfang has this happened multiple times? Can you please post the results of `devtools::session_info()`?