Bias trick in neural networks
Turning two tensors (weights and biases) into one (weights with biases) for simpler computation in neural networks
Bias trick is about simplifying a linear operation y = W * xi + b
so it doesn’t require adding a bias (b
term), but rather having it included in weights matrix W
, so we perform only a multiplication, instead of multiplication and addition
Instead of having tensor with weights (W
) and tensor with biases (b
), we can append biases to the tail of weights tensor and add 1
(bias dimension, a constant) to the vector with the training data (xi
)
We can visualize it in Python
import numpy as np
# Define the matrix W
W = np.array([
[0.2, -0.5, 0.1, 2.0],
[1.5, 1.3, 2.1, 0.0],
[0.0, 0.25, 0.2, -0.3]
])
# Define the vector xi
xi = np.array([56, 231, 24, 2])
# Define the bias vector b
b = np.array([1.1, 3.2, -1.2])
# Combine W and b into a new matrix W_new
W_new = np.hstack((W, b.reshape(-1, 1)))
# Define the augmented xi vector with an extra element for the bias
xi_augmented = np.append(xi, 1)
# Print the results
print("W:\n", W)
print("\nxi:\n", xi)
print("\nb:\n", b)
print("\nW_new:\n", W_new)
print("\nxi_augmented:\n", xi_augmented)
The output is
W:
[[ 0.2 -0.5 0.1 2. ]
[ 1.5 1.3 2.1 0. ]
[ 0. 0.25 0.2 -0.3 ]]
xi:
[ 56 231 24 2]
b:
[ 1.1 3.2 -1.2]
W_new:
[[ 0.2 -0.5 0.1 2. 1.1 ]
[ 1.5 1.3 2.1 0. 3.2 ]
[ 0. 0.25 0.2 -0.3 -1.2 ]]
xi_augmented:
[ 56 231 24 2 1]
Does this trick really works?
>>> original = np.dot(W, xi) + b
>>> original
array([-96.8 , 437.9 , 60.75])
>>> tricky = np.dot(W_new, xi_augmented)
>>> tricky
array([-96.8 , 437.9 , 60.75])
>>> original == tricky
array([ True, True, True])
Okay, but why it works, anyway?
What’s your intuition, before you read answer (unless you already know)?
…
…
The bias trick works, because of how dot product works
In dot product operation, we compute the output matrix by taking elements of vectors of W
matrix and elements of vector xi
and multiply the elements on corresponding positions
So when we add a bias vector to the end of weights matrix, each bias element becomes last element of each of vector in W matrix
These last elements are then multiplied by last element of new xi vector, which is 1
So essentially we do the same computations, but without having to store biases in a separate tensor
And that's it!
Btw I was too lazy to type the sample data for the tensors used here, so I took sample data from https://cs231n.github.io/linear-classify/ and asked ChatGPT to transform the tensors into Python code. CS231n is a great resource on neural networks, worth checking out!
Thanks for reading