jglaser issues

Results 24 issues of


                                            jglaser

support batch size dimension in 2D sparse attention mask

The purpose of this PR is to enable different attention masks per mini-batch in the sparse attention module. Generally, sentences are of different length and so it doesn't really make...

add additional type_traits methods

To enable compilation of code which includes the Random123 RNG library, we need the `std::make_signed` templates, which I am adding in this PR.

__shfl_sync instructions may have wrong member mask

When using `WarpScanShfl` from `warp_scan_shfl.cuh` inside a `while()` loop and in conjunction with a sub-warp `LOGICAL_WARP_THREADS` argument, i.e. `LOGICAL_WARP_THREADS=2^n` with `n

type: bug: functional

info needed

P3: backlog

repro: missing

fix 'invalid arguments' warp sync error on Volta

Fixes issue NVIDIA/cccl#854

type: bug: functional

info needed

P1: should have

repro: missing

Implement gradient pre-normalization in LAMB optimizer

This PR implements the normalization of gradients (by the norm of all gradients in the model) as discussed in https://developer.nvidia.com/blog/pretraining-bert-with-layer-wise-adaptive-learning-rates/ Adding the `prenorm` Boolean option to `torch_optimizer.lamb`

[BUG] Memory free leads to illegal memory access

**Describe the bug** With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash ``` 2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks||||| 2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor...

bug

[BUG] Log Directory creation causes error (unless it exists already)

**Describe the bug** Running on 90 workers, I get the following error ``` Could not create directory: /gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213[Errno 17] File exists: '/gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213' distributed.worker - WARNING - Compute Failed Function: initialize_server_directory...

bug

jglaser