[DEVLOG-001] - Neural network in Zig from scratch

Introduction

This is meant as an educational reference for anyone trying to understand neural networks at the most basic level. Sometimes it helps to just see some raw for loops and arrays to understand what is going on under the hood.

From scratch?

The best way to understand how something works is to fully understand it at every level. When I go looking for how things work, the last thing I want to see is someone’s code with an import giga-library and then call a function from that library that hides all the underlying logic. This means everything, from parsing the MNIST dataset, to performing linear algebra, will all be done with the standard zig library and iterating over for loops.

If you want to understand the linear algebra behind a simple neural network I highly recommend 3blue1brown’s video series on neural networks to better understand the math here.

3blue1brown Neural Networks

Want to just see the code? Github

Loading the MNIST dataset

First we need to get the MNIST dataset into memory. The easiest way is to use zig’s built in JSON parser. This means finding the MNIST dataset in JSON data. I pulled my data from This Github repo. Unzip it and store it locally as raw JSON data.

We can now create a Zig struct representing each one of our images and parse them into memory.

// main.zig
const std = @import('std');

const Image = struct {
    image: [784]f64,
    label: u8,
};

pub fn main() !void {
    // Parse MNIST Training and Test Data

    // Initialize allocator
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const alloc = gpa.allocator();

    const cwd = std.fs.cwd();

    const trainFileContents = try cwd.readFileAlloc(alloc, "./mnist/mnist_handwritten_train.json", 500_000_000); // Roughly 500 Mb
    defer _ = alloc.free(trainFileContents);

    const testFileContents = try cwd.readFileAlloc(alloc, "./mnist/mnist_handwritten_test.json", 74_000_000); // Roughly 74 Mb
    defer _ = alloc.free(testFileContents);

    std.debug.print("Parsing MNIST Training Data...", .{});
    const data_train = try std.json.parseFromSlice([]Image, alloc, trainFileContents, .{});
    defer data_train.deinit();
    std.debug.print("Done!\n", .{});

    std.debug.print("Parsing MNIST Test Data...", .{});
    const data_test = try std.json.parseFromSlice([]Image, alloc, testFileContents, .{});
    defer data_test.deinit();
    std.debug.print("Done!\n", .{});
}

Initialize our weights and biases

We will write some helper functions here for initializing our weight and bias matrices. The goal is to generate a random number somewhat close to 0. The prng.random().float(f64) * 0.1 - 0.05 will generate a random number between -0.05 and 0.05. Other examples may show generating a random number between -1 and 1, however with f64 type values, I found the network would collapse, or fail to train, when using weights and biases not close enough to zero.

Our network will be in the shape of 784 -> 20 -> 10 where the input has a 28x28 pixel image (784 pixels), a hidden layer of 20 neurons, and an output layer of 10 neurons (equivalent to the number of labels on our dataset). 20 was picked because we can run all this code on a single thread of zig code while being able to keep track of our matrix sizes throughout the code.

// main.zig
//...

// simple random number generator
var prng = std.Random.DefaultPrng.init(42);

fn random_vector(comptime m: usize) [m]f64 {
    var output: [m]f64 = undefined;
    for (output, 0..) |_, idx| {
        output[idx] = prng.random().float(f64) * 0.1 - 0.05;
    }
    return output;
}

fn random_matrix(comptime m: usize, comptime n: usize) [m][n]f64 {
    var output: [m][n]f64 = undefined;
    for (output, 0..) |row, row_idx| {
        for (row, 0..) |_, col_idx| {
            output[row_idx][col_idx] = prng.random().float(f64) * 0.1 - 0.05;
        }
    }
    return output;
}

pub fn main() !void {
    // ...

    // Hidden layer 1
    // weights: 20 x 784
    // neurons: 20
    // bias: 20
    var w1: [20][784]f64 = random_matrix(20, 784);
    var b1: [20]f64 = random_vector(20);

    // Hidden layer 2 (output layer)
    // weights: 10 x 20
    // neurons: 10
    // bias: 10
    var w2: [10][20]f64 = random_matrix(10, 20);
    var b2: [10]f64 = random_vector(10);

    // Define vectors for the outputs
    var z1: [20]f64 = undefined;
    var a1: [20]f64 = undefined;
    var z2: [10]f64 = undefined;
    var a2: [10]f64 = undefined;
}

We have also created our outputs z1, a1, z2, and a2 for use later.

Forward Propogation

To keep things simple, we are only taking one training sample at a time. In a real neural network you would “vectorize” the entire dataset at once with the network.

The activation function

For each neurons output we need to apply an activation function. There are many different types and you can really just pick whatever you would like, but the ReLU is one of the most common. All the ReLU function does is that for each value in an m x 1 matrix, if the value is negative output 0, if the value is positive output that value. We can easily define a helper function for our code taking advantage of the builtin @max() function.

// main.zig
// ...

pub fn ReLU(comptime T: type, comptime m: usize, x: [m]T) [m]T {
    var output: [m]T = undefined;
    for (&output, x) |*out, in| {
        out.* = @max(0, in);
    }
    return output;
}

// ...

The networks non-normalized output

The output of our network spits out a 10 x 1 matrix. Each row is mapped to the neural networks digit guess. So row 0 is guess 0, row 1 is guess 1, and so on. However right now the data is not very useful, it would be more helpful if each row represented a value from 0 to 1 with all rows adding up to 1. To do this we perform a “softmax” on the output. In zig the @exp() is the built in expression for e^x.

// main.zig
//...

pub fn softmax(comptime T: type, comptime m: usize, v: [m]T) [m]T {
    var max: T = v[0];
    var output: [m]f64 = undefined;
    var sum: f64 = 0.0;

    // Get max value
    for (v[1..]) |val| {
        if (val > max) {
            max = val;
        }
    }

    // Find exponentials relative to max value (this prevents overflow)
    for (&output, v) |*out, val| {
        out.* = @exp(val - max);
        sum += out.*;
    }

    // Normalize
    for (&output) |*x| {
        x.* /= sum;
    }

    return output;
}

// ...

The feed-forward

This is the basics of a neural network. When we “input” data into the network we are performing a forward propogation, or feed-forward. The network will output a 10 x 1 matrix that represents what the network “thinks” the data is. If the value in row 8 is the largest of all the values, then the neural network has guessed the image is of an 8. We additionally write some helper functions for adding matrices and performing the dot product on matrices

// main.zig
// ...

pub fn add(comptime T: type, comptime m: usize, v1: [m]T, v2: [m]T) [m]T {
    var output: [m]T = undefined;
    for (&output, v1, v2) |*out, x, y| {
        out.* = x + y;
    }
    return output;
}

pub fn dot(comptime T: type, comptime m: usize, comptime n: usize, mat1: [m][n]T, mat2: [n]T) [m]T {
    var output: [m]T = undefined;

    for (mat1, 0..) |row, row_idx| {
        for (row, mat2) |col1, col2| {
            output[row_idx] += col1 * col2;
        }
    }

    return output;
}

// ...
pub fn main() !void {
    // ...

    // Forward propogation

    z1 = dot(f64, 20, 784, w1, input);
    z1 = add(f64, 20, z1, b1);
    a1 = ReLU(f64, 20, z1);

    z2 = dot(f64, 10, 20, w2, a1);
    z2 = add(f64, 10, z2, b2);
    a2 = softmax(f64, 10, z2);
}

Back-propogation

Back propogation is how we determine how much we need to adjust all of our weights and bias’. It is where the majority of our computation time comes in.

For the back propogation the first thing we need to do is take our label for our training data and create a 10 x 1 matrix. This matrix will contain a 1 in the correct digits row, and 0’s everywhere else. We will call this a one_hot() function. We then subtract our guess output with the one-hot encoded training data label

// main.zig
// ...

fn one_hot(label: u8) [10]f64 {
    var output: [10]f64 = .{0.0} ** 10;
    output[label] = 1.0;
    return output;
}

// element wise subtract function for vectors
pub fn sub(comptime T: type, comptime m: usize, v1: [m]T, v2: [m]T) [m]T {
    var output: [m]T = undefined;
    for (&output, v1, v2) |*val, x, y| {
        val.* = x - y;
    }
    return output;
}

// ...

pub fn main() !void {
    // ...

     const y = one_hot(label);

     const dz2: [10]f64 = sub(f64, 10, a2, y);
}

dz2 represents our back-propogation on the softmax that was performed during forward-propogation

Now that we have dz2 we can calculate our change in weights and bias’ for the second hidden layer. We are taking advantage of our ability to iterate over the raw values in our matrices but what we are performing is dz2 dot a1(transposed). We also divide each row * col by 60,000 since this is the number of samples we are iterating over each epoch. Our change in our second bias are the sum of all elements in our dz2. We also write a helper function for that here.

// main.zig
// ...

pub fn vsum(comptime T: type, comptime m: usize, v: [m]T) T {
    var sum: f64 = 0.0;
    for (v) |val| {
        sum += val;
    }
    return sum;
}

// ...

pub fn main() !void {
    // ...

    // dw2 = dot(dz2, a1.transpose())
    var dw2: [10][20]f64 = .{.{0.0} ** 20} ** 10;
    for (dz2, 0..) |row, m| {
        for (a1, 0..) |col, n| {
            dw2[m][n] = (1.0 / 60000.0) * row * col;
        }
    }
    const db2: f64 = (1.0 / 60000.0) * vsum(f64, 10, dz2);
}

Backpropogating over our ReLU function requires taking the derivative of the ReLU. This is incredibly simple because the derivative of 0 is 0 and the derivative of a slope of 1 is 1. So here is our helper function

// main.zig
// ...

pub fn ReLU_deriv(comptime T: type, x: T) T {
    return if (x < 0) 0.0 else 1.0;
}

// ...

We can now perform the same step as before to get our dz1 by calculating w2(transposed) dot dz2 * ReLU_derivative(z1).

// main.zig
// ...

pub fn main() !void {
    // ...

    var dz1: [20]f64 = .{0.0} ** 20;
    // dot( w2.T , dz2) * ReLU_deriv(z1);
    for (0..20) |i| {
        for (0..10) |j| {
            dz1[i] += w2[j][i] * dz2[j];
        }
        dz1[i] *= ReLU_deriv(f64, z1[i]);
    }

Using dz1 we can now find our change in weights and bias for the first layer.

// main.zig
// ...
pub fn main() !void {
    // ...

    // dw1 = dot(dZ1, input.T)
    var dw1: [20][784]f64 = .{.{0.0} ** 784} ** 20;
    for (dz1, 0..) |row, m| {
        for (input, 0..) |col, n| {
            dw1[m][n] = (1.0 / 60000.0) * row * col;
        }
    }

    const db1: f64 = (1.0 / 60000.0) * vsum(f64, 20, dz1);
}

Update the network

Now that we have found dw2, db2, dw1, and db1 we can apply each of these to our weight and bias matrices to update our network with what it has learned. We need an alpha value here to determine how fast we train our neural network. This is something that is usually determined by trial and error. In this network an alpha = 0.002 produced fairly good results.

// main.zig
// ...

const alpha: f64 = 0.002;

pub fn main() !void {
    // ...

    // Update w1
    for (&w1, 0..) |*row, m| {
        for (row, 0..) |*val, n| {
            val.* -= alpha * dw1[m][n];
        }
    }

    // Update b1
    for (&b1) |*val| {
        val.* -= alpha * db1;
    }

    // Update w2
    for (&w2, 0..) |*row, m| {
        for (row, 0..) |*val, n| {
            val.* -= alpha * dw2[m][n];
        }
    }

    // Update b2
    for (&b2) |*val| {
        val.* -= alpha * db2;
    }
}

Looping over the entire training set

We can now take our 3 steps of taining our neural network and loop over the entire dataset. Each loop over the dataset is called an “Epoch” and we will need to perform multiple epochs to see our neural network learning. We will also add logic for checking our data against the test dataset and determining how accurate we are. In more advanced neural network applications you would likely see the accuracy monitored with weights being saved, allowing the training to be automated and to stop when reaching a point where the test set of data has reached a “maximum” in it’s accuracy and the model begins to overfit the training data.

We will also write another helper function here called vmax that will determine if the output matrix’s highest value is the same as our label’s.

// main.zig
// ...

pub fn vmax(comptime T: type, v: [10]T) usize {
    var output: usize = 0;
    var max: f64 = 0.0;
    for (v, 0..) |i, idx| {
        if (i > max) {
            max = i;
            output = idx;
        }
    }
    return output;
}

pub fn main() !void {
    // ...
    var right: f32 = 0.0;

    for (0..50) |epoch| {
        std.debug.print("[{d}]  ", .{epoch});
        std.debug.print("Accuracy: {d:.2}%\n", .{100 * (right / 10_000.0)});
        right = 0;

        for (data_train.value) |data| {
            // Input layer (Data)
            const input: [784]f64 = data.image;
            const label: u8 = data.label;
            
            // Forward propogation
            // ...

            // Backwards propogation
            // ...

            // Update network
            // ...
        }

        // Check the test set
        for (data_test.value) |data| {
            // Input layer (Data)
            const input: [784]f64 = data.image;
            const label: u8 = data.label;

            ////////////////////////////////////////////////
            // Image Classification /////////////////////////
            ////////////////////////////////////////////////
            z1 = dot(f64, 20, 784, w1, input);
            z1 = add(f64, 20, z1, b1);
            a1 = ReLU(f64, 20, z1);

            z2 = dot(f64, 10, 20, w2, a1);
            z2 = add(f64, 10, z2, b2);
            a2 = softmax(f64, 10, z2);

            // Check if guess is right
            if (vmax(f64, a2) == @as(usize, label)) {
                right += 1;
            }
        }
    } // end of epoch loop
}

Code output

When we run this program we can see that during each epoch the accuracy climbs as our network makes more and more correct guesses. With a 784 -> 20 -> 10 network we don’t expect to see anything extrodinary in terms of accuracy. Depending on how initialized weights/biases are set and alpha the network may train faster (or slower) and reach a more or less accurate state.

$ zig build run
////////// Parsing MNIST Training Data ///////////
/////////////////// Complete /////////////////////
/////////// Parsing MNIST Test Data //////////////
/////////////////// Complete /////////////////////
Epoch
[0]  Accuracy: 0.00%
[1]  Accuracy: 22.18%
[2]  Accuracy: 29.55%
[3]  Accuracy: 35.50%
[4]  Accuracy: 40.12%
[5]  Accuracy: 43.94%
[6]  Accuracy: 47.35%
[7]  Accuracy: 50.07%
[8]  Accuracy: 52.29%
[9]  Accuracy: 54.30%
...
[44]  Accuracy: 75.85%
[45]  Accuracy: 76.04%
[46]  Accuracy: 76.25%
[47]  Accuracy: 76.46%
[48]  Accuracy: 76.63%
[49]  Accuracy: 76.82%

Github link to the full code? Github