Prefix Tuning 介绍

本文旨在结合原文与Peft源码，介绍Prefix Tuning

读懂本文需：

理解NLP、Transformer结构
熟悉Pytorch、Python3

Prefix Tuning提供相较于全量微调更为轻量级的选项，自然语言生成（natural language generation, NLG）任务

In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation (NLG) tasks, inspired by prompting.

可训练参数的位置

如下图，可训练参数位于transformer激活层的左侧，Prefix Tuning在每一层影藏层都插入可训练参数

Meanwhile, this is less expressive than intervening all layers of the activations (§7.2), which avoids long-range dependencies and includes more tunable parameters. Prefix-tuning, therefore, optimizes all layers of the prefix.

参数结构

# model.py
# File Path: https://github.com/huggingface/peft/blob/main/src/peft/tuners/prefix_tuning/model.py
# Based on https://github.com/THUDM/P-tuning-v2/blob/main/model/prefix_encoder.py
# line 21
class PrefixEncoder(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.prefix_projection = config.prefix_projection
        token_dim = config.token_dim
        num_layers = config.num_layers
        encoder_hidden_size = config.encoder_hidden_size
        num_virtual_tokens = config.num_virtual_tokens
        if self.prefix_projection and not config.inference_mode:
            # Use a two-layer MLP to encode the prefix
            self.embedding = torch.nn.Embedding(num_virtual_tokens, token_dim)
            self.transform = torch.nn.Sequential(
                torch.nn.Linear(token_dim, encoder_hidden_size),
                torch.nn.Tanh(),
                torch.nn.Linear(encoder_hidden_size, num_layers * 2 * token_dim),
            )
        else:
            self.embedding = torch.nn.Embedding(num_virtual_tokens, num_layers * 2 * token_dim)

初始化函数末尾self.embedding = torch.nn.Embedding(num_virtual_tokens, num_layers * 2 * token_dim)

num_virtual_tokens：插入的prefix长度，是上图中的prefix_length
num_layers：transformer_backbone的层数
token_dim：token向量化embeded后的维度，transformer中常与encoder_hidden_size相等，图中的hidden dimension

num_layers * 2 * token_dim，参数 $\cdot2$ 用于同时设置key和value，参照上图虚线左侧的部分展示可训练参数结构与数量

$h_i^{(n)}$ is composed of a key-value pair. In GPT-2, the dimension of each key and value is 1024.

prefix_preojection

上述代码条件分支语句if self.prefix_projection and not config.inference_mode:作用：

论文作者发现，仅使用单层Prefix结构训练，无法稳定收敛，对学习率和初始化方式敏感。后额外引入全连接层，将prefix映射至其它线性空间。在训练结束后，可丢弃映射前的prefix向量和全连接层，仅保留prefix新线性空间中的投影，作为前向传播时前置激活层。

prefix_preojection参数在peft框架中是可选项，选择是否映射。

Empirically, directly updating the $P_{\theta}$ parameters leads to unstable optimization and a slight drop in performance. So we reparametrize the matrix $P_{\theta}[i,:]=MLP_{\theta}(P^{'}_{\theta}[i,:])$ by a smaller matrix $P^{'}_{\theta}$ composed with a large feedforward neural network ( $MLP_{\theta}$ ). Note that $P_{\theta}$ and $P^{'}_{\theta}$ has the same rows dimension (i.e. the prefix length), but different columns dimension. Once training is complete, these reparametrization parameters can be dropped, and only the prefix ( $P_{\theta}$ ) needs to be saved.

We find in preliminary experiments that directly optimizing the prefix is very sensitive to the learning rate and initialization.

$P_{\theta}$ has a dimension of $|P_{idx}\times dim(h_i)|$ while $P_\theta$ has a dimension of $|P_{idx}\times k|$ , where we choose $k=512$ for table-to-text and 800 for summarization. $MLP_{\theta}$ maps from dimension $k$ to $dim_({h_i})$

前向传播流程

引入Prefix前，模型前向传播公式：

h_i=LM_{\Theta}(z_i,h_{<i})

引入Prefix后（见下图），模型前向传播公式（ $P_{\theta}$ 是新增的可训练参数矩阵， $P_{\theta}\in \mathbb{R}^{ |P_{idx}| \times dim(h_i)}$ ）：

h_i= \begin{cases} P_{\theta}[i, :] & \text{if } i \in P_{\text{idx}} \\ LM_{\phi}(z_{i}, h_{<i}) & \text{otherwise} \end{cases}

当 $i \in P_{idx}$ 时， $h_i$ 直接依赖可训练参数 $\theta$ ；当 $i \notin P_{idx}$ 时， $h_i$ 的结果依赖所有之前的 $h_{<i}$ ，同样受到 $\theta$ 影响。

Here, $h_i$ (for all $i$ ) is a function of the trainable $P_{\theta}$ . When $i \in P_{idx}$ , this is clear because $h_i$ copies directly from $P_{\theta}$ . When $i \notin P_{idx}$ , $h_i$ still depends on $P_{\theta}$ , because the prefix activations are always in the left context and will therefore affect any activations to its right.

下图上方是autogressive LM，下方为encoder-decoer model。Prefix激活层 $\forall i \in P_{idx}$ , $h_i$ 从可训练参数 $P_{\theta}$ 中获得，余下激活层（隐藏层）由正常的Transformer流程产生。

An annotated example of prefix-tuning using an autoregressive LM (top) and an encoder-decoder model (bottom). The prefix activations $\forall i \in P_{idx}$ , $h_i$ are drawn from a trainable matrix $P_{\theta}$ . The remaining activations are computed by the Transformer.

talk is cheap, show me your code.

# peft_model.py
# https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py
# line 1116
class PeftModelForCausalLM(PeftModel):
    # example file is peft_prefix_tuning_clm.ipynb
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        task_ids=None,
        **kwargs,
    ):
        if peft_config.peft_type == PeftType.PREFIX_TUNING:
            past_key_values = self.get_prompt(batch_size)
            return self.base_model(
                input_ids=input_ids, inputs_embeds=inputs_embeds, past_key_values=past_key_values, **kwargs
            )

代码末尾，Prefix Tuning在执行前向传播的流程时，根据Prefix长度，提前设置好input_ids和past_key_values，借助现有模型完成前向传播。挂载方式无需修改任何原有模型结构。

input_ids：在prepare_process阶段，提前准备好的索引index和词向量，可以参考peft_prefix_tuning_clm.ipynb
inputs_embeds is None
kwargs有效信息为attention_mask和labels
准备past_key_values过程如下

# peft_model.py
# https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py
# line 447   
    def get_prompt(self, batch_size: int, task_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Returns the virtual prompts to use for Peft. Only applicable when using a prompt learning method.
        """
        peft_config = self.active_peft_config
        prompt_encoder = self.prompt_encoder[self.active_adapter]
        prompt_tokens = (
            self.prompt_tokens[self.active_adapter]
            .unsqueeze(0)
            .expand(batch_size, -1)
            .to(prompt_encoder.embedding.weight.device)
        )
        # ... ....
        return past_key_values