Prefix Tuning 介绍
本文旨在结合原文与Peft源码,介绍Prefix Tuning
读懂本文需:
- 理解NLP、Transformer结构
- 熟悉Pytorch、Python3
Prefix Tuning提供相较于全量微调更为轻量级的选项,自然语言生成(natural language generation, NLG)任务
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation (NLG) tasks, inspired by prompting.
可训练参数的位置
如下图,可训练参数位于transformer激活层的左侧,Prefix Tuning在每一层影藏层都插入可训练参数
Meanwhile, this is less expressive than intervening all layers of the activations (§7.2), which avoids long-range dependencies and includes more tunable parameters. Prefix-tuning, therefore, optimizes all layers of the prefix.
参数结构
1 |
|
初始化函数末尾self.embedding = torch.nn.Embedding(num_virtual_tokens, num_layers * 2 * token_dim)
num_virtual_tokens
:插入的prefix长度,是上图中的prefix_lengthnum_layers
:transformer_backbone的层数token_dim
:token向量化embeded后的维度,transformer中常与encoder_hidden_size
相等,图中的hidden dimension
num_layers * 2 * token_dim
,参数用于同时设置key
和value
,参照上图虚线左侧的部分展示可训练参数结构与数量
is composed of a key-value pair. In GPT-2, the dimension of each key and value is 1024.
prefix_preojection
上述代码条件分支语句if self.prefix_projection and not config.inference_mode:
作用:
论文作者发现,仅使用单层Prefix结构训练,无法稳定收敛,对学习率和初始化方式敏感。后额外引入全连接层,将prefix映射至其它线性空间。在训练结束后,可丢弃映射前的prefix向量和全连接层,仅保留prefix新线性空间中的投影,作为前向传播时前置激活层。
prefix_preojection
参数在peft框架中是可选项,选择是否映射。
Empirically, directly updating the parameters leads to unstable optimization and a slight drop in performance. So we reparametrize the matrix by a smaller matrix composed with a large feedforward neural network (). Note that and has the same rows dimension (i.e. the prefix length), but different columns dimension. Once training is complete, these reparametrization parameters can be dropped, and only the prefix () needs to be saved.
We find in preliminary experiments that directly optimizing the prefix is very sensitive to the learning rate and initialization.
has a dimension of while has a dimension of , where we choose for table-to-text and 800 for summarization. maps from dimension to
前向传播流程
引入Prefix前,模型前向传播公式:
引入Prefix后(见下图),模型前向传播公式(是新增的可训练参数矩阵,):
当 时,直接依赖可训练参数;当 时,的结果依赖所有之前的,同样受到影响。
Here, (for all ) is a function of the trainable . When , this is clear because copies directly from . When , still depends on , because the prefix activations are always in the left context and will therefore affect any activations to its right.
下图上方是autogressive LM,下方为encoder-decoer model。Prefix激活层, 从可训练参数中获得,余下激活层(隐藏层)由正常的Transformer流程产生。
An annotated example of prefix-tuning using an autoregressive LM (top) and an encoder-decoder model (bottom). The prefix activations , are drawn from a trainable matrix . The remaining activations are computed by the Transformer.
talk is cheap, show me your code.
1 |
|
代码末尾,Prefix Tuning在执行前向传播的流程时,根据Prefix长度,提前设置好input_ids和past_key_values,借助现有模型完成前向传播。挂载方式无需修改任何原有模型结构。
input_ids
:在prepare_process阶段,提前准备好的索引index和词向量,可以参考peft_prefix_tuning_clm.ipynb
inputs_embeds
is Nonekwargs
有效信息为attention_mask
和labels
- 准备
past_key_values
过程如下
1 |
|
prompt_encoder
:参数结构章节代码准备的self.embedding
.expand( batch_size, -1)
:为batch中的每组输入配置相同的prefix
拆解self.embedding
(一个矩阵)流程不作展开