oneflow copied to clipboard
Add gelu fast activation
add gelu_fast activation.
这个和 torch.nn.GELU(approximate='tanh') 是重复的吗
这个和 torch.nn.GELU(approximate='tanh') 是重复的吗
大老师,这个和 torch.nn.GELU(approximate='tanh')不是重复的,但是都是gelu的近似,gelu的变种有点多:
class NewGELUActivation(nn.Module):
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper:
def forward(self, input: Tensor) -> Tensor:
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
class GELUActivation(nn.Module):
Original Implementation of the GELU activation function in Google BERT repo when initially created. For
information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
Also see the Gaussian Error Linear Units paper:
def __init__(self, use_gelu_python: bool = False):
if version.parse(version.parse(torch.__version__).base_version) < version.parse("1.4") or use_gelu_python:
self.act = self._gelu_python
self.act = nn.functional.gelu
def _gelu_python(self, input: Tensor) -> Tensor:
return input * 0.5 * (1.0 + torch.erf(input / math.sqrt(2.0)))
def forward(self, input: Tensor) -> Tensor:
return self.act(input)
class FastGELUActivation(nn.Module):
Applies GELU approximation that is slower than QuickGELU but more accurate. See:
def forward(self, input: Tensor) -> Tensor:
return 0.5 * input * (1.0 + torch.tanh(input * 0.7978845608 * (1.0 + 0.044715 * input * input)))
class QuickGELUActivation(nn.Module):
Applies GELU approximation that is fast but somewhat inaccurate. See:
def forward(self, input: Tensor) -> Tensor:
return input * torch.sigmoid(1.702 * input)
class ClippedGELUActivation(nn.Module):
Clip the range of possible GeLU outputs between [min, max]. This is especially useful for quantization purpose, as
it allows mapping negatives values in the GeLU spectrum. For more information on this trick, please refer to
Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
initially created.
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))). See
def __init__(self, min: float, max: float):
if min > max:
raise ValueError(f"min should be < max (got min: {min}, max: {max})")
self.min = min
self.max = max
def forward(self, x: Tensor) -> Tensor:
return torch.clip(gelu(x), self.min, self.max)
从文档来看确实是和 torch.nn.GELU(approximate='tanh') 公式一样的。
oneflow 文档(渲染好像不太正确,以及混用了 input 和 x):
pytorch 文档(pytorch 文档也没正确渲染 lol):
oneflow 文档里 input 和 x 应该是同一个东西?把 input 换成 x 并放进括号里就和 pytorch 公式一样了。这么来看上面贴的 NewGELUActivation 和 FastGELUActivation 也是一样的。
根据大老师的建议,这个算子改成torch.nn.functional.gelu(x, approximate='fast')的形式