Self-Attention 与 Multi-head Attention 核心原理及代码实现

详细讲解了 Self-Attention 和 Multi-head Attention 的数据流动过程及 PyTorch 实现。重点阐述了 Q/K/V 矩阵的维度变换，特别是如何通过 view 和 transpose 将 [B, L, d_model] 转换为 [B, num_heads, L, d_k] 以支持并行计算。包含单头与多头的完整代码示例及维度操作原理解析。

机器人发布于 2026/3/27更新于 2026/4/164 浏览

Self-Attention 与 Multi-head Attention 核心原理及代码实现

文章配图

输入 X: [B, L, d_model]

Q/K/V 权重：[d_model, d_model] (合头写法，拆开后每头是 [d_model, d_k])

多头时：先全量 linear 得 [B, L, d_model]，再 view/reshape 成 [B, L, num_heads, d_k]，再 permute 成 [B, num_heads, L, d_k]

先用简单的 Self-Attention 捋一遍数据流动的过程：

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, d_k):
        super().__init__()
        self.embed_dim = embed_dim
        self.d_k = d_k
        self.W_Q = nn.Linear(embed_dim, d_k)
        self.W_K = nn.Linear(embed_dim, d_k)
        self.W_V = nn.Linear(embed_dim, d_k)

    def forward(self, x):
        # x: [batch_size, seq_len, embed_dim]
        Q = self.W_Q(x)  # [B, L, D]
        K = self.W_K(x)  # [B, L, D]
        V = self.W_V(x)  # [B, L, D]
        
        # Attention scores: [B, L, L]
        score = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        attn_weights = F.softmax(score, dim=-1)  # [B, L, L]
        att_output = torch.matmul(attn_weights, V)  
         att_output

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online

import torch
import torch.nn as nn
import math
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, head_num):
        super().__init__()
        self.embed_dim = embed_dim
        self.head_num = head_num
        self.head_dim = embed_dim // head_num
        
        self.W_Q = nn.Linear(embed_dim, embed_dim)
        self.W_K = nn.Linear(embed_dim, embed_dim)
        self.W_V = nn.Linear(embed_dim, embed_dim)
        self.W_O = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()
        
        # 先全量投影得到了 QKV 矩阵再拆头
        Q = self.W_Q(x)  # [B, L, embed_dim]
        K = self.W_K(x)  # [B, L, embed_dim]
        V = self.W_V(x)  # [B, L, embed_dim]
        
        # 拆分多头
        # 方法：先 view，再 transpose
        # 拆分成 [B, L, num_heads, head_dim]，再变成 [B, num_heads, L, head_dim]
        Q = Q.view(batch_size, seq_len, self.head_num, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.head_num, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.head_num, self.head_dim).transpose(1, 2)
        
        # 此时 shape 均为 [B, num_heads, L, head_dim]
        # Q @ K^T：最后两维做乘法
        # K.transpose(-2, -1): [B, num_heads, head_dim, L]
        score = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn_weights = F.softmax(score, dim=-1)
        
        # 得到每个头的注意力输出
        att_output = torch.matmul(attn_weights, V)  # [B, num_heads, L, head_dim]
        
        # 变回 [B, L, embed_dim]
        # 先 transpose(1,2): [B, L, num_heads, head_dim]
        # 然后 view 为 [B, L, num_heads*head_dim] = [B, L, embed_dim]
        att_output = att_output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.embed_dim)
        output = self.W_O(att_output)  # [B, L, embed_dim]
        return output

Self-Attention 与 Multi-head Attention 核心原理及代码实现

更多推荐文章

相关免费在线工具

为什么要拆分成 `(num_heads, head_dim)`？

原始 Q 的 shape

目标：希望得到一个 shape = [B, num_heads, L, head_dim]

为什么用 `view(B, L, num_heads, head_dim).transpose(1, 2)`？

Step 1: `view(B, L, num_heads, head_dim)`

Step 2: `transpose(1, 2)`

过程可视化

为什么顺序不能交换？

为什么最终要 [B, num_heads, L, head_dim]？

总结口诀

Self-Attention 与 Multi-head Attention 核心原理及代码实现

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

为什么要拆分成 (num_heads, head_dim)？

原始 Q 的 shape

目标：希望得到一个 shape = [B, num_heads, L, head_dim]

为什么用 view(B, L, num_heads, head_dim).transpose(1, 2)？

Step 1: view(B, L, num_heads, head_dim)

Step 2: transpose(1, 2)

过程可视化

为什么顺序不能交换？

为什么最终要 [B, num_heads, L, head_dim]？

总结口诀

为什么要拆分成 `(num_heads, head_dim)`？

为什么用 `view(B, L, num_heads, head_dim).transpose(1, 2)`？

Step 1: `view(B, L, num_heads, head_dim)`

Step 2: `transpose(1, 2)`