7 activation functions and their implementation

Implemention of 7 activation functions in C and Zig

All posts

Machine Learning has been gaining in popularity recently, and i dare say it has coarced me into jumping into this rabbit hole.

Vast and wild topic this may be - I thought it might be worth noting my meagre understanding into a blog.

In a neural network, activation functions are - as the name suggests, functions that determine if a perticular node/neuron should be activated.

The output of each node in a neural network is determined based on their individual inputs and weights.

In this post we take a look at the implemention in multiple languages; C, Zig, and WGSL. Mostly because i find it easier to understand a topic with a strongly-typed language.

Todays lil’ stars are

  • Binary Step
  • Sigmoid
  • ReLU
  • SiLU/Swish
  • Softmax
  • SwiGLU
  • GEGLU


Binary Step

The most basic activation function. It returns 1 if the input is greater than 0 and returns 0 otherwise.

step(x)={0ifx>11ifx0

Implementation:

// In C
float step(float x) {
    return (float)x > 0;
}
// In Zig
fn step(T: type, x: T) T {
  return @floatFromInt(@intFromBool(x > 0));
}


Sigmoid

This function produces a smooth continuous curve in an ‘S’ shape.

It is best used for predicting the probability of an output; questions such as

  • chance of rain tommorow?
  • how spammy is that post?

sigmoid(x)=11+ex

Implementation:

// In C
float sigmoid(float x) {
    return 1.0 / (1.0 + expf(-x));
}
// In WGSL
fn sigmoid(x: f32) -> f32 {
  return 1.0_f32 / (1.0_f32 + exp(-x));
}
// In Zig
fn sigmoid(T: type, x: T) T {
  return 1.0 / (1.0 + @exp(-x));
}


Rectified Linear Unit (ReLU)

A simple fuction that returns 0, if x is below 0, otherwise it returns x.

Due to this, only a select number of neurons/nodes are activated. This makes the model computationally efficent.

However, just as you would expect; if you don’t use all your brain cells, in some cases, you wouldn’t be able to do somethings very clearly.

ReLU(x)=max(0,x)

Implementation:

// In C
float relu(float x) {
    if (x < 0) return 0;
    return 1;
}
// In WGSL
fn relu(x: f32) -> f32 {
  return max(0, x);
}
// In Zig
fn relu(T: type, x: f32) T {
    return @max(0, x);
}


Sigmoid Linear Unit (SiLU)

Also called as Sigmoid shrinkage (SiL) or the Swish_1 function, it is a smoother approximation which uses the sigmoid function.

It’s most popularly used for object detection such as in YOLO

silu(x)=x·sigmoid(x)

expanding we get,

silu(x)=x·(11+ex)
silu(x)=x1+ex

Implemention:

// In C
float silu(float x) {
    return x / (1.0 + expf(-x));
}
// In WGSL
fn silu(x: f32) -> f32 {
  return x / (1.0_f32 + exp(-x));
}
// In Zig
fn silu(T: type, x: T) T {
  return x / (1.0 + @exp(-x));
}

Softmax

This form of activation is useful when you want to classify into multiple cases.

For example, if you where predicting the weather today; you could have multiple classes like Sunny, Cloudy, Rainy, etc…

It is commonly used at the final layer of Attention.

softmax(xi)=exij=1n(exj)

Let’s break it down for our implementation,

// In C
float* softmax(float* x, int length) {
    // Pre-Calculate the sum of e^x_j
    int sum_ex = 0;
    for (int j = 0; j < length; j++) sum_ex = expf(x[j]);

    // Perform softmax on each element in x
    for (int i = 0; i < length; i++) x[i] = expf(x) / sum_ex;
}

For our zig implementation we’ll use Vectors to take advantage of SIMD.

// In Zig

const Vec = @Vector(3, f32); // replace '3' with any length

fn softmax(x: Vec) Vec {
    const sum_ex = @reduce(.Add, @exp(x));
    return @exp(x) / sum_ex;
}

If you were hoping to clap your hands and move on to the next function - i have bad news for you.

This implementation is unfortunatly prone to overflow (rounding off to infinity) and underflow (rounding off to zero). A trick to fix these issues by subtracting the maximum value among x by all the elements of x.

// Stable Version in Zig
fn stable_softmax(x: Vec) Vec {
    const max = @reduce(.Max, x); // find the max value
    const z = x - max; // subtract the max element from all elements
    const e = @exp(z);
    const sum = @reduce(.Add, e); // sum of exponents
    return e / sum; // normalize
}


Swish-Gated Linear Unit (SwiGLU)

Built on top of Swish and as a varient of Gated Linear Unit (GLU). It is the goto activation function for modern models such as LLaMA.

According to experiments done on T5; SwiGLU outperformes both ReLU and Swish.

It is worth noting that unlike classic FFN; for SwiGLU it is nessacary for each layer to have 3 weight matrices.

SwiGLU(x)=W3·silu(x)

expanding,

SwiGLU(x)=W3·(x1+ex)
SwiGLU(x)=W3·x1+ex
// In C
float swiglu(float x, float w3) {
    x *= 1.0f / (1.0f - expf(-x)); // here we do, x * sigmoid(x)
    return w3 * x;
}
// In Zig - Using Vectors
const Vec = @Vector(3, f32);
fn swiglu(x: Vec, beta: w3) Vec {
    const silu = x / (1.0 - @exp(-x));
    return w3 * silu;
}


Gaussian Error Gated Linear Unit (GEGLU)

Introduced in the same paper as SwiGLU, GEGLU slightly outperforms SwiGLU.

The main difference between SwiGLU and GEGLU is that instead of the SiLU/Swish_1 function we use Gaussian Error Linear Units (GELU).

GEGLU(x)=W3·gelu(x)

Approximating GELU, by using the sigmoid function - we could create an approximate implementaion,

GELU(x)x·sigmoid(1.702·x)

NOTE: this is an approximation of GELU and hence GEGLU, it is not an exact implementation

// In C
float geglu(float x, float w3) {
    x *= 1.0f / (1.0f + expf(1.702 * -x));
    return w3 * x;
}
// In Zig
const Vec = @Vector(3, f32);

fn geglu(x: Vec, w3: Vec) Vec {
    const gelu = x / (1 + @exp(1.702 * -x));
    return w3 * gelu;
}