Understanding the Mathematics of Multi-Head Attention in Transformers

1: Introduction

1.1: Overview of Transformers

The Transformer model, as outlined by Vaswani et al. in "Attention is All You Need," has significantly influenced deep learning, particularly in the field of natural language processing (NLP). Its self-attention mechanism allows it to process entire input sequences simultaneously, enhancing both computational speed and the ability to manage long-range dependencies. If this concept seems new to you, don’t worry; we’ll clarify it by the end of this article. First, let’s examine the basic structure of a Transformer.

A Transformer is composed of two primary components: the encoder and the decoder. The encoder transforms the input sequence into a continuous representation, while the decoder generates the output sequence based on this representation. Both components consist of multiple layers, each featuring two key elements: a multi-head self-attention mechanism and a position-wise feed-forward network. While this article will primarily focus on the multi-head attention aspect, we will delve into the complete Transformer architecture in future discussions.

1.2: Introduction to Multi-Head Attention

Multi-head attention allows the model to simultaneously concentrate on different segments of the input sequence, capturing various characteristics of the data. Imagine it as multiple spotlights illuminating different performers on a stage. Each spotlight (or "head") can highlight distinct features of the data, enabling the audience (or model) to perceive the entire scene more clearly. By dividing the input into numerous subspaces, each with its own attention mechanism, multi-head attention provides the model with multiple perspectives of the input data, improving its understanding of complex interrelations.

This method empowers the Transformer to discern diverse relationships within the data by focusing on different segments of the sequence. It enhances the learning process by providing various viewpoints of the input, improving the model's ability to generalize. Additionally, it increases the model's expressiveness by allowing it to learn different aspects of the input data concurrently.

These features make multi-head attention a vital part of the success of Transformer models across diverse applications, from language translation to image analysis.

2: Mathematical Foundations

2.1: The Attention Mechanism

The attention mechanism in neural networks is designed to emulate the human capacity to focus on certain information while processing data. When reading, for instance, your eyes do not concentrate equally on every word; they prioritize significant terms that aid in understanding the narrative. Similarly, attention in neural networks enables the model to dynamically assess the significance of different input elements, allowing it to focus on parts of the input sequence that are more relevant for output generation, enhancing performance in tasks like language translation and text summarization.

Mathematically, the attention mechanism can be articulated using a set of queries, keys, and values. Let’s represent the input as a set of queries Q, keys K, and values V, which are typically linear transformations of the input data.

The attention scores are derived by computing the dot product of the query with all keys, yielding a measure of similarity. For a query q and keys k1, k2, …, kn, the attention scores can be expressed as:

This process can be likened to assessing how similar each word (key) in a sentence is to the word (query) you’re concentrating on, where higher scores indicate greater similarity.

To avoid excessively large dot products, particularly with high-dimensional vectors, we scale the scores by the square root of the dimension of the keys, d_k:

This adjustment is akin to modulating the spotlight's intensity based on the stage size, ensuring that the scores remain manageable and aiding in maintaining stable gradients during training. By scaling, we ensure that the values sent to the softmax function have a standard deviation close to 1, which is crucial for stable gradient behavior.

To understand the necessity of scaling, consider the properties of dot products and high-dimensional vectors. When calculating the dot product of two vectors q and k_i of dimension d_k, the expected value is proportional to d_k. Without scaling, as d_k increases, the variance of the dot product escalates, resulting in very large values that might cause the softmax function to yield near-binary outputs (i.e., probabilities close to 0 or 1). Such sharpness diminishes the model's learning capability, as it leads to very small gradients.

Dividing the dot product by d_k normalizes the input to the softmax function, ensuring that scores are kept within a reasonable range, which is essential for effective and stable learning.

These scaled scores are subsequently passed through a softmax function to derive the attention weights. The softmax function converts the scores into probabilities that indicate the relative significance of each key concerning the query:

This stage is akin to transforming the adjusted spotlight intensities into a clear ranking, accentuating the most relevant aspects of the scene.

The final attention output is achieved by computing a weighted sum of the values using the attention weights:

In this equation, v_i represents the value corresponding to key k_i. This weighted sum integrates the most pertinent information from the values, akin to focusing on the key elements of a book to enhance comprehension.

2.2: Exploring Multi-Head Attention

Multi-head attention is a sophisticated variant of the attention mechanism, allowing a model to simultaneously focus on different segments of the input sequence, capturing various relationships within the data. Instead of utilizing a single attention mechanism, multi-head attention divides the input into multiple "heads," each possessing its own set of queries, keys, and values. Each head independently conducts the attention operation, and their outputs are subsequently combined. This approach amplifies the model's ability to comprehend intricate patterns and dependencies in the data.

Consider trying to comprehend a complex scene with numerous elements. If you had several pairs of eyes, each observing different parts, you would gain a more comprehensive understanding. Similarly, multi-head attention enables the model to examine different sections of the input data concurrently, yielding a richer and more detailed representation.

Given an input sequence X, we project it into queries Q, keys K, and values V through learned linear transformations. For each head i, we have distinct weight matrices W_Q, W_K, and W_V:

These projections enable each head to focus on different facets of the input data. For each head i, we calculate the attention scores using the scaled dot-product attention mechanism. The attention output for head i is expressed as:

Here, d_k is the dimension of the key vectors, ensuring proper scaling of the scores.

Once we compute the attention outputs for all heads, we concatenate them along the feature dimension. If there are h heads, each yielding an output of dimension d_v, the concatenated output will have a dimension of h × d_v:

The concatenated output is then projected back to the original input dimension d using a learned weight matrix W_O:

The central concept behind combining multiple attention heads is to enable the model to capture different types of information from the input sequence simultaneously. By employing multiple heads, each can learn to focus on distinct parts of the input or various features, leading to a richer and more nuanced representation of the data.

2.3: Position-wise Feed-Forward Networks

In the Transformer architecture, each layer comprises a multi-head attention mechanism followed by a position-wise feed-forward network. These feed-forward layers are applied independently to each position in the sequence, hence the term “position-wise.” They are essentially straightforward fully connected neural networks applied uniformly to each position of the input sequence.

Picture a factory where every product on a conveyor belt passes through the same set of machines. Each machine processes the product in a specific manner, adding or refining something. Likewise, each position in the sequence is processed independently by the feed-forward layers, transforming and enhancing the representation.

The purpose of these feed-forward layers is to introduce non-linearity and increased learning capacity to the model. After the attention mechanism aggregates information from various parts of the sequence, the feed-forward network processes this data to further transform and refine the representation.

Mathematically, a position-wise feed-forward network consists of two linear transformations with a ReLU activation function in between. For an input x at a specific position, the feed-forward network can be represented as:

Here: - W1 and W2 are learned weight matrices. - b1 and b2 are learned bias vectors. - max(0, xW1 + b1) signifies the ReLU activation function applied element-wise.

The input x is first linearly transformed using the weight matrix W1 and bias b1:

Think of this step as the input passing through the first machine in the factory, which applies initial modifications based on learned weights and biases.

Following the linear transformation is a ReLU activation function that introduces non-linearity:

ReLU (Rectified Linear Unit) sets all negative values to zero, enabling the model to capture non-linear relationships within the data. This step ensures that only positive contributions from the first machine are passed forward.

Subsequently, the activated output undergoes a second linear transformation using weight matrix W2 and bias b2:

This final step further refines the output, similar to the second machine in the factory making additional adjustments to yield a finished product.

The position-wise feed-forward network in the Transformer architecture further processes the information captured by the multi-head attention mechanism. While the attention mechanism helps the model focus on various parts of the sequence and aggregate context-specific information, the feed-forward network refines and transforms this information at each position, thus enhancing the model’s capability to capture intricate patterns and dependencies.

3: Implementing Multi-Head Attention from Scratch

In this section, we will dissect and elucidate the implementation of a multi-head attention mechanism from the ground up using Python and numpy. The objective is to grasp how the input is altered throughout the process. Before diving into the details, take a moment to review the code we will cover in this section. You should get a general overview; however, we will thoroughly examine each line later.

# models-from-scratch-python/Multi-Head Attention/demo.py at main ·…

# Repo where I recreate some popular machine learning models from scratch in Python …

To start, we define the MultiHeadAttention class, which is responsible for managing the parameters required for the multi-head attention mechanism. Let's break down the setup step by step.

import numpy

class MultiHeadAttention:

def __init__(self, num_hiddens, num_heads, dropout=0.0, bias=False):

self.num_heads = num_heads

self.num_hiddens = num_hiddens

self.d_k = self.d_v = num_hiddens // num_heads

In the initialization method, we first set the number of attention heads and the total number of hidden units in the model. These values are provided as arguments when the class is instantiated.

num_hiddens: The total number of hidden units in the model, crucial for determining the size of linear transformations applied to the input data.
num_heads: The number of attention heads, enabling the model to learn to focus on different parts of the input concurrently.
dropout: The dropout rate, included here for completeness but not utilized in this implementation.
bias: A boolean flag indicating whether to include bias terms in the linear transformations.

Next, we compute the dimensions for the queries and values for each head. Since the total number of hidden units (num_hiddens) is divided among all heads (num_heads), each head will have a query and value dimension of num_hiddens // num_heads.

self.W_q = np.random.rand(num_hiddens, num_hiddens)

self.W_k = np.random.rand(num_hiddens, num_hiddens)

self.W_v = np.random.rand(num_hiddens, num_hiddens)

self.W_o = np.random.rand(num_hiddens, num_hiddens)

We then initialize the weight matrices for the queries, keys, values, and output transformations, all randomly initialized:

W_q: Transforms input data into queries with dimensions num_hiddens x num_hiddens.
W_k: Transforms input data into keys, also num_hiddens x num_hiddens.
W_v: Transforms input data into values, maintaining the same dimensions.
W_o: Transforms the concatenated output of all heads back to the original input dimensions.

if bias:

self.b_q = np.random.rand(num_hiddens)

self.b_k = np.random.rand(num_hiddens)

self.b_v = np.random.rand(num_hiddens)

self.b_o = np.random.rand(num_hiddens)

else:

self.b_q = self.b_k = self.b_v = self.b_o = np.zeros(num_hiddens)

Finally, we initialize the bias vectors for the queries, keys, values, and output transformations. If the bias parameter is True, these biases are randomly initialized; otherwise, they are set to zero:

b_q: Bias for the query transformation.
b_k: Bias for the key transformation.
b_v: Bias for the value transformation.
b_o: Bias for the output transformation.

The biases have dimensions equal to the number of hidden units, num_hiddens.

By establishing these weights and biases, we ensure that each attention head can independently learn to focus on different segments of the input data.

Next, we define methods to prepare and transform the data for multi-head attention. First, let’s examine the transpose_qkv method:

def transpose_qkv(self, X):

X = X.reshape(X.shape[0], X.shape[1], self.num_heads, -1)

X = X.transpose(0, 2, 1, 3)

return X.reshape(-1, X.shape[2], X.shape[3])

This method reshapes and transposes the input data in preparation for multi-head attention. Specifically:

X = X.reshape(X.shape[0], X.shape[1], self.num_heads, -1)

This line reshapes the input tensor X to have four dimensions: (batch_size, sequence_length, num_heads, depth_per_head).

X.shape[0]: The batch size.
X.shape[1]: The sequence length (number of positions in the input sequence).
self.num_heads: The number of attention heads.
-1: Automatically infers the size of the last dimension (depth per head) to keep the total element count constant.

X = X.transpose(0, 2, 1, 3)

This line transposes the tensor, reordering it to (batch_size, num_heads, sequence_length, depth_per_head).

This rearrangement ensures that each attention head can independently process its portion of the input sequence.

return X.reshape(-1, X.shape[2], X.shape[3])

This final reshape flattens the batch and head dimensions into a single dimension, resulting in a tensor of shape (batch_size * num_heads, sequence_length, depth_per_head).

By utilizing transpose_qkv, we ensure that the input data is effectively divided among multiple heads, with each head having the appropriate dimensions for processing its segment of the data.

Next, we define the transpose_output method:

def transpose_output(self, X):

X = X.reshape(-1, self.num_heads, X.shape[1], X.shape[2])

X = X.transpose(0, 2, 1, 3)

return X.reshape(X.shape[0], X.shape[1], -1)

This method reverses the transformation performed by transpose_qkv, combining the outputs from all heads back into the original shape.

After we transpose our matrices, we can proceed with the scaled dot-product attention mechanism, which allows the model to concentrate on different segments of the input sequence with varying degrees of importance.

def scaled_dot_product_attention(self, Q, K, V, valid_lens):

d_k = Q.shape[-1]

scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

if valid_lens is not None:

mask = np.arange(scores.shape[-1]) < valid_lens[:, None]

scores = np.where(mask[:, None, :], scores, -np.inf)

attention_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))

attention_weights /= attention_weights.sum(axis=-1, keepdims=True)

return np.matmul(attention_weights, V)

The inputs to this method are the query (Q), key (K), and value (V) matrices, which are derived from the input data through linear transformations.

d_k = Q.shape[-1]

Here, we extract the dimension of the key vectors, d_k, from the last dimension of the query matrix Q. This value is used for scaling the attention scores.

scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

We compute the attention scores by performing a matrix multiplication of Q and the transpose of K. The scores are then scaled by the square root of d_k. This scaling is essential to prevent the scores from becoming excessively large, which can create issues during the softmax calculation.

Next, we define the forward pass method to process the input data through the multi-head attention mechanism. This method orchestrates the entire multi-head attention process, from transforming the input data to combining the outputs from multiple heads.

def forward(self, queries, keys, values, valid_lens):

queries = self.transpose_qkv(np.dot(queries, self.W_q) + self.b_q)

keys = self.transpose_qkv(np.dot(keys, self.W_k) + self.b_k)

values = self.transpose_qkv(np.dot(values, self.W_v) + self.b_v)

if valid_lens is not None:

valid_lens = np.repeat(valid_lens, self.num_heads, axis=0)

output = self.scaled_dot_product_attention(queries, keys, values, valid_lens)

output_concat = self.transpose_output(output)

return np.dot(output_concat, self.W_o) + self.b_o

Let’s break down the forward method:

queries = self.transpose_qkv(np.dot(queries, self.W_q) + self.b_q) keys = self.transpose_qkv(np.dot(keys, self.W_k) + self.b_k) values = self.transpose_qkv(np.dot(values, self.W_v) + self.b_v)

First, the input queries, keys, and values are projected into their respective subspaces using the learned weight matrices (W_q, W_k, W_v) and biases (b_q, b_k, b_v). This is achieved through matrix multiplication with the weight matrices, followed by adding the biases. The results are then transformed for multi-head attention using the transpose_qkv method, ensuring each head processes its input independently.

Queries, keys, and values are the transformed inputs, now ready for multi-head attention.

if valid_lens is not None:

valid_lens = np.repeat(valid_lens, self.num_heads, axis=0)

If valid_lens are provided, they are repeated for each head. This guarantees the creation of the appropriate mask for each attention head, allowing the model to focus solely on valid positions within the sequences.

output = self.scaled_dot_product_attention(queries, keys, values, valid_lens)

The method calls scaled_dot_product_attention with the transformed queries, keys, values, and repeated valid lengths. This function calculates the attention scores, applies the softmax function to acquire attention weights, and computes the weighted sum of the values to produce the attention output for each head.

output_concat = self.transpose_output(output) return np.dot(output_concat, self.W_o) + self.b_o

After obtaining the attention outputs from all heads, the method concatenates these outputs along the feature dimension using transpose_output. This method reverses the initial transformation, merging the outputs from all heads into a single representation. The concatenated output is then transformed back to the original input dimension using a final linear transformation with weight matrix W_o and bias b_o.

Lastly, we can test the class with some sample data. Here’s how to do it:

# Define dimensions and initialize multi-head attention num_hiddens, num_heads = 100, 5 attention = MultiHeadAttention(num_hiddens, num_heads, dropout=0.5, bias=False)

We initialize the MultiHeadAttention class with 100 hidden units and 5 attention heads. This sets up the necessary parameters and weight matrices for the multi-head attention mechanism.

# Define sample data batch_size, num_queries, num_kvpairs = 2, 4, 6 valid_lens = np.array([3, 2]) X = np.random.rand(batch_size, num_queries, num_hiddens) # Simulated input queries Y = np.random.rand(batch_size, num_kvpairs, num_hiddens) # Simulated key-value pairs

We create random data to simulate input queries (X) and key-value pairs (Y). The batch size is 2, the number of queries is 4, and the number of key-value pairs is 6. We also define valid lengths (valid_lens) to indicate the valid positions within the sequences.

# Apply multi-head attention output = attention.forward(X, Y, Y, valid_lens)

We pass the sample data through the multi-head attention mechanism using the forward method. This processes the input queries, keys, and values while applying the multi-head attention calculations.

print("Output shape:", output.shape) # Expected output: (2, 4, 100) print("Output data:", output)

We print the shape and content of the output. The expected output shape ensures that the output dimensions align with the original input dimensions. We then display the output data after computing Multi-Head Attention. Now that you have a grasp of how the Multi-Head Attention mechanism operates, feel free to experiment with it. For example, try changing the number of heads, adding multiple FFNs before and after Multi-Head Attention, or implementing it in a machine translation task to observe it in action. Let me know if you'd like to explore this in a future article.

Conclusion

Transformers have revolutionized deep learning, particularly in NLP, by utilizing self-attention mechanisms that enable parallel processing of input sequences. This approach accelerates computation and manages long-range dependencies more effectively than traditional recurrent neural networks.

In this article, we’ve developed a thorough understanding of multi-head attention in Transformers, from its mathematical theory to practical code implementation. While the concepts may initially seem abstract, as the outputs of Multi-Head Attention alone do not directly yield actionable results, we will soon see how they play a crucial role in the Transformer architecture, the foundation of well-known LLMs like Claude and ChatGPT. Stay tuned for future articles, where we will delve deeper into the remaining components of the Transformer architecture, providing further insights into this powerful model.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS).
Alammar, J. (2018). The Illustrated Transformer. jalammar.github.io.

zhaopinboai.com

Understanding the Mathematics of Multi-Head Attention in Transformers

1: Introduction

1.1: Overview of Transformers

1.2: Introduction to Multi-Head Attention

2: Mathematical Foundations

2.1: The Attention Mechanism

2.2: Exploring Multi-Head Attention

2.3: Position-wise Feed-Forward Networks

3: Implementing Multi-Head Attention from Scratch

Conclusion

References

Share the page:

Recent Post:

Achieving Dreams: A Journey of Self-Discovery and Growth

Living Well on a Budget: Embrace Quality Over Quantity

Unagi Model One Electric Scooter: My Experience and Insights

Kentucky's 'Tent Girl' Finally Identified After Three Decades

# Navigating Life's Challenges: Finding Strength in Tough Times

Transforming From Shy to Confident: My Journey to Self-Discovery

Maximizing LinkedIn: 3 Effective Strategies for Side-Hustlers

Stop Apologizing: Embrace Change and Live Life Fully