Prefix Tuning 介绍

本文旨在结合原文与Peft源码,介绍Prefix Tuning

读懂本文需:

  • 理解NLP、Transformer结构
  • 熟悉Pytorch、Python3

Prefix Tuning提供相较于全量微调更为轻量级的选项,自然语言生成(natural language generation, NLG)任务

In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation (NLG) tasks, inspired by prompting.

可训练参数的位置

如下图,可训练参数位于transformer激活层的左侧,Prefix Tuning在每一层影藏层都插入可训练参数

Meanwhile, this is less expressive than intervening all layers of the activations (§7.2), which avoids long-range dependencies and includes more tunable parameters. Prefix-tuning, therefore, optimizes all layers of the prefix.

参数结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# model.py
# File Path: https://github.com/huggingface/peft/blob/main/src/peft/tuners/prefix_tuning/model.py
# Based on https://github.com/THUDM/P-tuning-v2/blob/main/model/prefix_encoder.py
# line 21
class PrefixEncoder(torch.nn.Module):
def __init__(self, config):
super().__init__()
self.prefix_projection = config.prefix_projection
token_dim = config.token_dim
num_layers = config.num_layers
encoder_hidden_size = config.encoder_hidden_size
num_virtual_tokens = config.num_virtual_tokens
if self.prefix_projection and not config.inference_mode:
# Use a two-layer MLP to encode the prefix
self.embedding = torch.nn.Embedding(num_virtual_tokens, token_dim)
self.transform = torch.nn.Sequential(
torch.nn.Linear(token_dim, encoder_hidden_size),
torch.nn.Tanh(),
torch.nn.Linear(encoder_hidden_size, num_layers * 2 * token_dim),
)
else:
self.embedding = torch.nn.Embedding(num_virtual_tokens, num_layers * 2 * token_dim)

初始化函数末尾self.embedding = torch.nn.Embedding(num_virtual_tokens, num_layers * 2 * token_dim)

  • num_virtual_tokens:插入的prefix长度,是上图中的prefix_length
  • num_layers:transformer_backbone的层数
  • token_dim:token向量化embeded后的维度,transformer中常与encoder_hidden_size相等,图中的hidden dimension

num_layers * 2 * token_dim,参数2\cdot2用于同时设置keyvalue,参照上图虚线左侧的部分展示可训练参数结构与数量

hi(n)h_i^{(n)}is composed of a key-value pair. In GPT-2, the dimension of each key and value is 1024.

prefix_preojection

上述代码条件分支语句if self.prefix_projection and not config.inference_mode:作用:

论文作者发现,仅使用单层Prefix结构训练,无法稳定收敛,对学习率和初始化方式敏感。后额外引入全连接层,将prefix映射至其它线性空间。在训练结束后,可丢弃映射前的prefix向量和全连接层,仅保留prefix新线性空间中的投影,作为前向传播时前置激活层。

prefix_preojection参数在peft框架中是可选项,选择是否映射。

Empirically, directly updating the PθP_{\theta} parameters leads to unstable optimization and a slight drop in performance. So we reparametrize the matrix Pθ[i,:]=MLPθ(Pθ[i,:])P_{\theta}[i,:]=MLP_{\theta}(P^{'}_{\theta}[i,:]) by a smaller matrix PθP^{'}_{\theta} composed with a large feedforward neural network (MLPθMLP_{\theta}). Note that PθP_{\theta} and PθP^{'}_{\theta} has the same rows dimension (i.e. the prefix length), but different columns dimension. Once training is complete, these reparametrization parameters can be dropped, and only the prefix (PθP_{\theta}​) needs to be saved.

We find in preliminary experiments that directly optimizing the prefix is very sensitive to the learning rate and initialization.

PθP_{\theta} has a dimension of Pidx×dim(hi)|P_{idx}\times dim(h_i)| while PθP_\theta has a dimension of Pidx×k|P_{idx}\times k|, where we choose k=512k=512 for table-to-text and 800 for summarization. MLPθMLP_{\theta} maps from dimension kk to dim(hi)dim_({h_i})

前向传播流程

引入Prefix前,模型前向传播公式:

hi=LMΘ(zi,h<i)h_i=LM_{\Theta}(z_i,h_{<i})

引入Prefix后(见下图),模型前向传播公式(PθP_{\theta}是新增的可训练参数矩阵,PθRPidx×dim(hi)P_{\theta}\in \mathbb{R}^{ |P_{idx}| \times dim(h_i)}):

hi={Pθ[i,:]if iPidxLMϕ(zi,h<i)otherwiseh_i= \begin{cases} P_{\theta}[i, :] & \text{if } i \in P_{\text{idx}} \\ LM_{\phi}(z_{i}, h_{<i}) & \text{otherwise} \end{cases}

iPidxi \in P_{idx} 时,hih_i直接依赖可训练参数θ\theta;当 iPidxi \notin P_{idx}时,hih_i的结果依赖所有之前的h<ih_{<i},同样受到θ\theta影响。

Here, hih_i (for all ii) is a function of the trainable PθP_{\theta}. When iPidxi \in P_{idx}, this is clear because hih_i copies directly from PθP_{\theta}. When iPidxi \notin P_{idx}, hih_i still depends on PθP_{\theta}, because the prefix activations are always in the left context and will therefore affect any activations to its right.

下图上方是autogressive LM,下方为encoder-decoer model。Prefix激活层iPidx\forall i \in P_{idx}, hih_i 从可训练参数PθP_{\theta}中获得,余下激活层(隐藏层)由正常的Transformer流程产生。

An annotated example of prefix-tuning using an autoregressive LM (top) and an encoder-decoder model (bottom). The prefix activations iPidx\forall i \in P_{idx}, hih_i are drawn from a trainable matrix PθP_{\theta}. The remaining activations are computed by the Transformer.

talk is cheap, show me your code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# peft_model.py
# https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py
# line 1116
class PeftModelForCausalLM(PeftModel):
# example file is peft_prefix_tuning_clm.ipynb
def forward(
self,
input_ids=None,
attention_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
task_ids=None,
**kwargs,
):
if peft_config.peft_type == PeftType.PREFIX_TUNING:
past_key_values = self.get_prompt(batch_size)
return self.base_model(
input_ids=input_ids, inputs_embeds=inputs_embeds, past_key_values=past_key_values, **kwargs
)

代码末尾,Prefix Tuning在执行前向传播的流程时,根据Prefix长度,提前设置好input_ids和past_key_values,借助现有模型完成前向传播。挂载方式无需修改任何原有模型结构。

  • input_ids:在prepare_process阶段,提前准备好的索引index和词向量,可以参考peft_prefix_tuning_clm.ipynb
  • inputs_embeds is None
  • kwargs有效信息为attention_masklabels
  • 准备past_key_values过程如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# peft_model.py
# https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py
# line 447
def get_prompt(self, batch_size: int, task_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
"""
Returns the virtual prompts to use for Peft. Only applicable when using a prompt learning method.
"""
peft_config = self.active_peft_config
prompt_encoder = self.prompt_encoder[self.active_adapter]
prompt_tokens = (
self.prompt_tokens[self.active_adapter]
.unsqueeze(0)
.expand(batch_size, -1)
.to(prompt_encoder.embedding.weight.device)
)
# ... ....
return past_key_values
  • prompt_encoder参数结构章节代码准备的self.embedding
  • .expand( batch_size, -1):为batch中的每组输入配置相同的prefix

拆解self.embedding(一个num_virtual_tokens×num_layers2token_dimnum\_virtual\_tokens \times num\_layers * 2 * token\_dim矩阵)流程不作展开

参考

The Power of Scale for Parameter-Efficient Prompt Tuning

huggingface/peft

Transformer-based Bloom


Prefix Tuning 介绍
https://www.ydhuyong.online/2024/03/11/02_prefix_tuning/
作者
Yong
发布于
2024年3月11日
更新于
2024年3月19日
许可协议