7 activation functions and their implementation
Implemention of 7 activation functions in C and Zig
Machine Learning has been gaining in popularity recently, and i dare say it has coarced me into jumping into this rabbit hole.
Vast and wild topic this may be - I thought it might be worth noting my meagre understanding into a blog.
In a neural network, activation functions are - as the name suggests, functions that determine if a perticular node/neuron should be activated.
The output of each node in a neural network is determined based on their individual inputs and weights.
In this post we take a look at the implemention in multiple languages; C, Zig, and WGSL. Mostly because i find it easier to understand a topic with a strongly-typed language.
Todays lil’ stars are
- Binary Step
- Sigmoid
- ReLU
- SiLU/Swish
- Softmax
- SwiGLU
- GEGLU
Binary Step
The most basic activation function. It returns 1 if the input is greater than 0 and returns 0 otherwise.
Implementation:
// In C
float step(float x) {
return (float)x > 0;
}
// In Zig
fn step(T: type, x: T) T {
return @floatFromInt(@intFromBool(x > 0));
}
Sigmoid
This function produces a smooth continuous curve in an ‘S’ shape.
It is best used for predicting the probability of an output; questions such as
- chance of rain tommorow?
- how spammy is that post?
Implementation:
// In C
float sigmoid(float x) {
return 1.0 / (1.0 + expf(-x));
}
// In WGSL
fn sigmoid(x: f32) -> f32 {
return 1.0_f32 / (1.0_f32 + exp(-x));
}
// In Zig
fn sigmoid(T: type, x: T) T {
return 1.0 / (1.0 + @exp(-x));
}
Rectified Linear Unit (ReLU)
A simple fuction that returns 0, if x is below 0, otherwise it returns x.
Due to this, only a select number of neurons/nodes are activated. This makes the model computationally efficent.
However, just as you would expect; if you don’t use all your brain cells, in some cases, you wouldn’t be able to do somethings very clearly.
Implementation:
// In C
float relu(float x) {
if (x < 0) return 0;
return 1;
}
// In WGSL
fn relu(x: f32) -> f32 {
return max(0, x);
}
// In Zig
fn relu(T: type, x: f32) T {
return @max(0, x);
}
Sigmoid Linear Unit (SiLU)
Also called as Sigmoid shrinkage (SiL) or the Swish_1 function, it is a smoother approximation which uses the sigmoid function.
It’s most popularly used for object detection such as in YOLO
expanding we get,
Implemention:
// In C
float silu(float x) {
return x / (1.0 + expf(-x));
}
// In WGSL
fn silu(x: f32) -> f32 {
return x / (1.0_f32 + exp(-x));
}
// In Zig
fn silu(T: type, x: T) T {
return x / (1.0 + @exp(-x));
}
Softmax
This form of activation is useful when you want to classify into multiple cases.
For example, if you where predicting the weather today; you could have multiple classes like Sunny, Cloudy, Rainy, etc…
It is commonly used at the final layer of Attention.
Let’s break it down for our implementation,
// In C
float* softmax(float* x, int length) {
// Pre-Calculate the sum of e^x_j
int sum_ex = 0;
for (int j = 0; j < length; j++) sum_ex = expf(x[j]);
// Perform softmax on each element in x
for (int i = 0; i < length; i++) x[i] = expf(x) / sum_ex;
}
For our zig implementation we’ll use Vectors to take advantage of SIMD.
// In Zig
const Vec = @Vector(3, f32); // replace '3' with any length
fn softmax(x: Vec) Vec {
const sum_ex = @reduce(.Add, @exp(x));
return @exp(x) / sum_ex;
}
If you were hoping to clap your hands and move on to the next function - i have bad news for you.
This implementation is unfortunatly prone to overflow (rounding off to infinity) and underflow (rounding off to zero). A trick to fix these issues by subtracting the maximum value among x by all the elements of x.
// Stable Version in Zig
fn stable_softmax(x: Vec) Vec {
const max = @reduce(.Max, x); // find the max value
const z = x - max; // subtract the max element from all elements
const e = @exp(z);
const sum = @reduce(.Add, e); // sum of exponents
return e / sum; // normalize
}
Swish-Gated Linear Unit (SwiGLU)
Built on top of Swish and as a varient of Gated Linear Unit (GLU). It is the goto activation function for modern models such as LLaMA.
According to experiments done on T5; SwiGLU outperformes both ReLU and Swish.
It is worth noting that unlike classic FFN; for SwiGLU it is nessacary for each layer to have 3 weight matrices.
expanding,
// In C
float swiglu(float x, float w3) {
x *= 1.0f / (1.0f - expf(-x)); // here we do, x * sigmoid(x)
return w3 * x;
}
// In Zig - Using Vectors
const Vec = @Vector(3, f32);
fn swiglu(x: Vec, beta: w3) Vec {
const silu = x / (1.0 - @exp(-x));
return w3 * silu;
}
Gaussian Error Gated Linear Unit (GEGLU)
Introduced in the same paper as SwiGLU, GEGLU slightly outperforms SwiGLU.
The main difference between SwiGLU and GEGLU is that instead of the SiLU/Swish_1 function we use Gaussian Error Linear Units (GELU).
Approximating GELU, by using the sigmoid function - we could create an approximate implementaion,
NOTE:this is an approximation ofGELUand henceGEGLU, it is not an exact implementation
// In C
float geglu(float x, float w3) {
x *= 1.0f / (1.0f + expf(1.702 * -x));
return w3 * x;
}
// In Zig
const Vec = @Vector(3, f32);
fn geglu(x: Vec, w3: Vec) Vec {
const gelu = x / (1 + @exp(1.702 * -x));
return w3 * gelu;
}