7 activation functions and their implementation

Mar 30, 2025

Implemention of 7 activation functions in C and Zig

Machine Learning has been gaining in popularity recently, and i dare say it has coarced me into jumping into this rabbit hole.

Vast and wild topic this may be - I thought it might be worth noting my meagre understanding into a blog.

In a neural network, activation functions are - as the name suggests, functions that determine if a perticular node/neuron should be activated.

The output of each node in a neural network is determined based on their individual inputs and weights.

In this post we take a look at the implemention in multiple languages; C, Zig, and WGSL. Mostly because i find it easier to understand a topic with a strongly-typed language.

Todays lil’ stars are

Binary Step
Sigmoid
ReLU
SiLU/Swish
Softmax
SwiGLU
GEGLU

Binary Step

The most basic activation function. It returns 1 if the input is greater than 0 and returns 0 otherwise.

 $step (x) = {\begin{matrix} 0 & if x > 1 \\ 1 & if x \leq 0 \end{matrix}$

Implementation:

// In C
float step(float x) {
    return (float)x > 0;
}

// In Zig
fn step(T: type, x: T) T {
  return @floatFromInt(@intFromBool(x > 0));
}

Sigmoid

This function produces a smooth continuous curve in an ‘S’ shape.

It is best used for predicting the probability of an output; questions such as

chance of rain tommorow?
how spammy is that post?

 $s i g m o i d (x) = \frac{1}{1 + e^{- x}}$

Implementation:

// In C
float sigmoid(float x) {
    return 1.0 / (1.0 + expf(-x));
}

// In WGSL
fn sigmoid(x: f32) -> f32 {
  return 1.0_f32 / (1.0_f32 + exp(-x));
}

// In Zig
fn sigmoid(T: type, x: T) T {
  return 1.0 / (1.0 + @exp(-x));
}

Rectified Linear Unit (ReLU)

A simple fuction that returns 0, if x is below 0, otherwise it returns x.

Due to this, only a select number of neurons/nodes are activated. This makes the model computationally efficent.

However, just as you would expect; if you don’t use all your brain cells, in some cases, you wouldn’t be able to do somethings very clearly.

 $R e L U (x) = \max (0, x)$

Implementation:

// In C
float relu(float x) {
    if (x < 0) return 0;
    return 1;
}

// In WGSL
fn relu(x: f32) -> f32 {
  return max(0, x);
}

// In Zig
fn relu(T: type, x: f32) T {
    return @max(0, x);
}

Sigmoid Linear Unit (SiLU)

Also called as Sigmoid shrinkage (SiL) or the Swish_1 function, it is a smoother approximation which uses the sigmoid function.

It’s most popularly used for object detection such as in YOLO

 $s i l u (x) = x \cdot s i g m o i d (x)$

expanding we get,

 $s i l u (x) = x \cdot (\frac{1}{1 + e^{- x}})$

 $s i l u (x) = \frac{x}{1 + e^{- x}}$

Implemention:

// In C
float silu(float x) {
    return x / (1.0 + expf(-x));
}

// In WGSL
fn silu(x: f32) -> f32 {
  return x / (1.0_f32 + exp(-x));
}

// In Zig
fn silu(T: type, x: T) T {
  return x / (1.0 + @exp(-x));
}

Softmax

This form of activation is useful when you want to classify into multiple cases.

For example, if you where predicting the weather today; you could have multiple classes like Sunny, Cloudy, Rainy, etc…

It is commonly used at the final layer of Attention.

 $s o f t m a x (x_{i}) = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} (e^{x_{j}})}$

Let’s break it down for our implementation,

// In C
float* softmax(float* x, int length) {
    // Pre-Calculate the sum of e^x_j
    int sum_ex = 0;
    for (int j = 0; j < length; j++) sum_ex = expf(x[j]);

    // Perform softmax on each element in x
    for (int i = 0; i < length; i++) x[i] = expf(x) / sum_ex;
}

For our zig implementation we’ll use Vectors to take advantage of SIMD.

// In Zig

const Vec = @Vector(3, f32); // replace '3' with any length

fn softmax(x: Vec) Vec {
    const sum_ex = @reduce(.Add, @exp(x));
    return @exp(x) / sum_ex;
}

If you were hoping to clap your hands and move on to the next function - i have bad news for you.

This implementation is unfortunatly prone to overflow (rounding off to infinity) and underflow (rounding off to zero). A trick to fix these issues by subtracting the maximum value among x by all the elements of x.

// Stable Version in Zig
fn stable_softmax(x: Vec) Vec {
    const max = @reduce(.Max, x); // find the max value
    const z = x - max; // subtract the max element from all elements
    const e = @exp(z);
    const sum = @reduce(.Add, e); // sum of exponents
    return e / sum; // normalize
}

Swish-Gated Linear Unit (SwiGLU)

Built on top of Swish and as a varient of Gated Linear Unit (GLU). It is the goto activation function for modern models such as LLaMA.

According to experiments done on T5; SwiGLU outperformes both ReLU and Swish.

It is worth noting that unlike classic FFN; for SwiGLU it is nessacary for each layer to have 3 weight matrices.

 $S w i G L U (x) = W_{3} \cdot s i l u (x)$

expanding,

 $S w i G L U (x) = W_{3} \cdot (\frac{x}{1 + e^{- x}})$

 $S w i G L U (x) = \frac{W_{3} \cdot x}{1 + e^{- x}}$

// In C
float swiglu(float x, float w3) {
    x *= 1.0f / (1.0f - expf(-x)); // here we do, x * sigmoid(x)
    return w3 * x;
}

// In Zig - Using Vectors
const Vec = @Vector(3, f32);
fn swiglu(x: Vec, beta: w3) Vec {
    const silu = x / (1.0 - @exp(-x));
    return w3 * silu;
}

Gaussian Error Gated Linear Unit (GEGLU)

Introduced in the same paper as SwiGLU, GEGLU slightly outperforms SwiGLU.

The main difference between SwiGLU and GEGLU is that instead of the SiLU/Swish_1 function we use Gaussian Error Linear Units (GELU).

 $G E G L U (x) = W_{3} \cdot g e l u (x)$

Approximating GELU, by using the sigmoid function - we could create an approximate implementaion,

 $G E L U (x) \approx x \cdot s i g m o i d (1.702 \cdot x)$

NOTE: this is an approximation of GELU and hence GEGLU, it is not an exact implementation

// In C
float geglu(float x, float w3) {
    x *= 1.0f / (1.0f + expf(1.702 * -x));
    return w3 * x;
}

// In Zig
const Vec = @Vector(3, f32);

fn geglu(x: Vec, w3: Vec) Vec {
    const gelu = x / (1 + @exp(1.702 * -x));
    return w3 * gelu;
}