LLaVA 微调教程

发表于 2024-05-12 更新于 2024-07-02

写完之后发现他好像不是很需要这个东西，所以就先发在自己的博客好了。不投稿首页或者候选区应该本来也就不会有多少流量，所以应该不会干嘛的，大不了后面被说不让放网上以后就删掉这篇，嘻嘻。

LLaVA 是最早出现的 Vision Language Model。本教程将教你微调 llava-v1.5-13b 。与本博客现有的基于xtuner的微调教程不同，这个教程将使用deepspeed以拜托对书生生态的依赖。

配置环境

配置环境的官方教程即项目ReadMe

首先我们下载LLaVA的源代码

1
2
3

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pwd

然后配置Python环境。如果是在自己电脑上运行，请不要忘记创建conda虚拟环境

# conda create -n llava python=3.10 -y
# conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

最后是下载模型。你可以使用huggingface-cli直接下载模型。如果您所在的区域不能直接访问Hugging Face，则需要使用镜像网站下载

# 如果不能访问Hugging Face，可以执行下面这一行设置使用hf-mirror镜像站下载 HF_ENDPOINT=https://hf-mirror.com
# export HF_ENDPOINT=https://hf-mirror.com

# 下载 llava-v1.5-7b 模型权重
huggingface-cli download "liuhaotian/llava-v1.5-7b" --local-dir "./checkpoints/llava-v1.5-7b"

# 下载 clip-vit-large-patch14-336 模型权重
huggingface-cli download "openai/clip-vit-large-patch14-336" --local-dir "./checkpoints/clip-vit-large-patch14-336"

准备训练数据

官方预训练（训练投影层）使用的数据集是 LAION-CC-SBU，视觉微调使用的数据集是llava_v1_5_mix665k.json和其他一些数据集，在项目Readme中写得特别清楚。但是我并不打算在这里进行介绍或者是重新训练个新模型。我们将简单构造一个只有一张图像构成的简易数据集。

自定义训练数据集的格式要求在这里。

首先我们下载图片：

mkdir -p ./playground/data/yuanshen

# 下载图片
wget -O ./playground/data/yuanshen/1.jpg https://avatars.githubusercontent.com/u/86307756

然后准备图文对。这里只准备一个：

import json

dataset_content = """
[
    {
        "id": "yuanshen-628d-4724-b370-b84de974a19f",
        "image": "yuanshen/1.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWho is in the picture?"
            },
            {
                "from": "gpt",
                "value": "The person in the picture is Nathida, who is a character in the Original God and its derivative works produced by Mihoyo. Her real name is Buyel, the grass god in the \"Earthly Seven rulers\", and is given the nickname of \"Little Lucky Grass King\" by the XuMi people, the youngest of the seven gods today. "
            }
        ]
    }
]
"""

with open("./playground/data/yuanshen.json", "w") as f:
    f.write(dataset_content)

数据集图像为：

原神纳西妲

模型微调

这一步我们使用 deepspeed zero2 进行模型 LoRA 微调。得到的微调模型会被保存在./checkpoints/llava-v1.5-7b-lora里。

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/yuanshen.json \
    --image_folder ./playground/data \
    --vision_tower ./checkpoints/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-lora \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 10 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 2 \
    --lazy_preprocess True \
    --report_to wandb

模型训练的脚本自带wandb，根据情况选就好。不想用wandb就选3

wandb

之后慢慢等待训练完成

如果在这一步遇到错误，请移步Github issue查看有没有人和你碰到过一样的问题。如果核查确认没有可以试着提新issue。

模型微调源码选读

内容较长，点击展开查看

上面的命令使用deepspeed运行训练脚本llava/train/train_mem.py，而train_mem.py实际上只调用了llava/train/train.py里面的train(attn_implementation="flash_attention_2")。train函数做的事情如下：

首先使用transformers.HfArgumentParser类解析命令行参数，该类的作用是将命令行参数解析为dataclass对象。dataclass是Python3.7中引入的一个新特性，通过dataclass可以方便地定义一个类，并且可以自动实现__init__、__repr__等方法。

1 2	parser = transformers.HfArgumentParser( (ModelArguments, DataArguments, TrainingArguments))

然后通过parser.parse_args_into_dataclasses()方法解析命令行参数，并将解析结果保存到model_args、data_args和training_args三个变量中。

1	model_args, data_args, training_args = parser.parse_args_into_dataclasses()

训练精度配置与`BitsAndBytesConfig`类

接着配置训练精度：

compute_dtype = (torch.float16 if training_args.fp16 
                    else (torch.bfloat16 
                            if training_args.bf16 else torch.float32))

bnb_model_from_pretrained_args = {}

# 如果使用4位或8位的量化，涉及到QLoRA，需要设置相应的参数
if training_args.bits in [4, 8]:
    from transformers import BitsAndBytesConfig
    bnb_model_from_pretrained_args.update(dict(
        device_map={"": training_args.device},
        load_in_4bit=training_args.bits == 4,  # 是否加载4位量化模型
        load_in_8bit=training_args.bits == 8,  # 是否加载8位量化模型
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=training_args.bits == 4,
            load_in_8bit=training_args.bits == 8,
            llm_int8_skip_modules=["mm_projector"],  # 模块`mm_projector`不进行量化

            # 量化阈值设置。
            # 如果一个模型的权重或激活值在绝对值上小于 llm_int8_threshold，那么这些值将被量化为8位整形以减少内存使用。
            # 如果值的绝对值大于 llm_int8_threshold 则会继续一浮点数的形式存储，保留更多的精度。
            llm_int8_threshold=6.0,

            # llm_int8_has_fp16_weight用于设置LLM.int8()是否使用16位主权重。
            # 该参数控制权重是否在反向传播时进行转换。
            llm_int8_has_fp16_weight=False,

            # bnb_4bit_compute_dtype设置量化模型的计算数据类型
            bnb_4bit_compute_dtype=compute_dtype,

            # bnb_4bit_use_double_quant设置是否使用嵌套量化。
            # 这将会在第一轮量化之后启用第二轮量化，以便每个参数额外节省 0.4 比特。
            bnb_4bit_use_double_quant=training_args.double_quant,

            # bnb_4bit_quant_type设置量化数据类型。可以是'fp4'或'nf4'。
            bnb_4bit_quant_type=training_args.quant_type # {'fp4', 'nf4'}
        )
    ))

关于BitsAndBytesConfig类，这里给出官方文档给大家翻阅。

This is a wrapper class about all possible attributes and features that you can play with a model that has been
loaded using `bitsandbytes`.

This replaces `load_in_8bit` or `load_in_4bit`therefore both options are mutually exclusive.

Currently only supports `LLM.int8()`, `FP4`, and `NF4` quantization. If more methods are added to `bitsandbytes`,
then more arguments will be added to this class.

Args:
    load_in_8bit (`bool`, *optional*, defaults to `False`):
        This flag is used to enable 8-bit quantization with LLM.int8().
    load_in_4bit (`bool`, *optional*, defaults to `False`):
        This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from
        `bitsandbytes`.
    llm_int8_threshold (`float`, *optional*, defaults to 6.0):
        This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix
        Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value
        that is above this threshold will be considered an outlier and the operation on those values will be done
        in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but
        there are some exceptional systematic outliers that are very differently distributed for large models.
        These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of
        magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6,
        but a lower threshold might be needed for more unstable models (small models, fine-tuning).
    llm_int8_skip_modules (`List[str]`, *optional*):
        An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as
        Jukebox that has several heads in different places and not necessarily at the last position. For example
        for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.
    llm_int8_enable_fp32_cpu_offload (`bool`, *optional*, defaults to `False`):
        This flag is used for advanced use cases and users that are aware of this feature. If you want to split
        your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use
        this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8
        operations will not be run on CPU.
    llm_int8_has_fp16_weight (`bool`, *optional*, defaults to `False`):
        This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not
        have to be converted back and forth for the backward pass.
    bnb_4bit_compute_dtype (`torch.dtype` or str, *optional*, defaults to `torch.float32`):
        This sets the computational type which might be different than the input time. For example, inputs might be
        fp32, but computation can be set to bf16 for speedups.
    bnb_4bit_quant_type (`str`,  *optional*, defaults to `"fp4"`):
        This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types
        which are specified by `fp4` or `nf4`.
    bnb_4bit_use_double_quant (`bool`, *optional*, defaults to `False`):
        This flag is used for nested quantization where the quantization constants from the first quantization are
        quantized again.
    kwargs (`Dict[str, Any]`, *optional*):
        Additional parameters from which to initialize the configuration object.

在模型权重加载完成后，还会设置k比特训练：

if training_args.bits in [4, 8]:
    from peft import prepare_model_for_kbit_training
    model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)

模型权重加载

之后是对模型权重的加载。既然是微调，那就是在已有模型基础上使用数据对模型进行小学习速度的训练。

加载权重的逻辑很简单：

model_args.vision_tower不为空，'mpt' not in model_args.model_name_or_path：权重加载进LlavaLlamaForCausalLM类。这种情况即正常情况
model_args.vision_tower不为空，'mpt' in model_args.model_name_or_path：手动设置attn_impl并将权重加载进LlavaMptForCausalLM类
model_args.vision_tower为空，模型为llama模型，直接将权重加载进LlamaForCausalLM类

具体代码不放了。

梯度设置

冻结该冻结的，保留需要的：

if model_args.freeze_backbone:  # 冻结
    model.model.requires_grad_(False)

if training_args.gradient_checkpointing:  # 保留
    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()
    else:
        def make_inputs_require_grad(module, input, output):
            output.requires_grad_(True)
        model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

LoRA

if training_args.lora_enable:  # LoRA
    from peft import LoraConfig, get_peft_model
    lora_config = LoraConfig(
        r=training_args.lora_r,
        lora_alpha=training_args.lora_alpha,
        target_modules=find_all_linear_names(model),
        lora_dropout=training_args.lora_dropout,
        bias=training_args.lora_bias,
        task_type="CAUSAL_LM",
    )
    if training_args.bits == 16:
        if training_args.bf16:
            model.to(torch.bfloat16)
        if training_args.fp16:
            model.to(torch.float16)
    rank0_print("Adding LoRA adapters...")
    model = get_peft_model(model, lora_config)

if training_args.bits in [4, 8]:
    from peft.tuners.lora import LoraLayer
    for name, module in model.named_modules():
        if isinstance(module, LoraLayer):
            if training_args.bf16:
                module = module.to(torch.bfloat16)
        if 'norm' in name:
            module = module.to(torch.float32)
        if 'lm_head' in name or 'embed_tokens' in name:
            if hasattr(module, 'weight'):
                if training_args.bf16 and module.weight.dtype == torch.float32:
                    module = module.to(torch.bfloat16)

之后进行模型其他配置

模型训练

最后使用trainner训练模型

data_module = make_supervised_data_module(tokenizer=tokenizer,
                                            data_args=data_args)
trainer = LLaVATrainer(model=model,
                tokenizer=tokenizer,
                args=training_args,
                **data_module)

if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
    trainer.train(resume_from_checkpoint=True)
else:
    trainer.train()
trainer.save_state()

合并LoRA权重

完成模型训练以后，我们要将LoRA权重与原始模型权重合并：

1
2
3

python scripts/merge_lora_weights.py --model-path "./checkpoints/llava-v1.5-7b-lora" \
       --model-base "./checkpoints/llava-v1.5-7b" \
       --save-model-path "./checkpoints/llava-v1.5-7b-merged"

这样，就能得到可以直接用于推理的模型了，这个模型现在存储在./checkpoints/llava-v1.5-7b-merged文件夹下。

而所谓合并模型权重，就是先加载一遍base权重，再加载lora权重，最后再将整个模型的权重重新保存。

合并权重源码解读

内容较长，点击展开查看

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from llava.model import *
from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN

def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs):
    kwargs = {"device_map": device_map, **kwargs}
    # 量化加载相关，不细看
    if device != "cuda":
        kwargs['device_map'] = {"": device}
    if load_8bit:
        kwargs['load_in_8bit'] = True
    elif load_4bit:
        kwargs['load_in_4bit'] = True
        kwargs['quantization_config'] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        )
    else:
        kwargs['torch_dtype'] = torch.float16
    if use_flash_attn:
        kwargs['attn_implementation'] = 'flash_attention_2'

    if 'llava' in model_name.lower():
        # Load LLaVA model. 如果模型名称中包含'lora'且未提供`model_base`参数，则发出警告。
        if 'lora' in model_name.lower() and model_base is None:
            warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.')

        if 'lora' in model_name.lower() and model_base is not None:
            from llava.model.language_model.llava_llama import LlavaConfig
            lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)  # 导入LLaVA模型的配置文件`LlavaConfig`

            # 首先还是和没有lora一样加载base模型预训练权重
            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
            print('Loading LLaVA from base model...')
            model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)

            # 如果模型头部的输出特征数量与输入特征数量不匹配，则根据需要调整模型头部和嵌入层的权重
            token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
            if model.lm_head.weight.shape[0] != token_num:
                model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
                model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))

            print('Loading additional LLaVA weights...')
            # 如果lora权重（即model_path）在本地就直接从本地加载至`non_lora_trainables`变量
            # 否则把路径当huggingface hub仓库名从远端下载至`non_lora_trainables`变量内。
            if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
                non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
            else:
                # this is probably from HF Hub
                from huggingface_hub import hf_hub_download
                def load_from_hf(repo_id, filename, subfolder=None):
                    cache_file = hf_hub_download(
                        repo_id=repo_id,
                        filename=filename,
                        subfolder=subfolder)
                    return torch.load(cache_file, map_location='cpu')
                non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
            non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}

            # 调整非LoRA部分的模型参数字典的键名，并加载这些参数到模型中。
            if any(k.startswith('model.model.') for k in non_lora_trainables):
                non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
            model.load_state_dict(non_lora_trainables, strict=False)

            # 下面的操作均使用peft库完成
            from peft import PeftModel
            print('Loading LoRA weights...')
            model = PeftModel.from_pretrained(model, model_path)  # 从model_path中加载LoRA权重
            print('Merging LoRA weights...')
            model = model.merge_and_unload()  # 合并LoRA权重到模型中
            print('Model is loaded...')

        # 以下为没有lora，只有model_base的情况，不细看
        elif model_base is not None:
            ...
    image_processor = None

    # 如果模型是llava模型，则对分词器添加几个特殊标记，同时加载视觉塔
    if 'llava' in model_name.lower():
        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
        if mm_use_im_patch_token:
            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
        if mm_use_im_start_end:
            tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
        model.resize_token_embeddings(len(tokenizer))

        vision_tower = model.get_vision_tower()
        if not vision_tower.is_loaded:
            vision_tower.load_model(device_map=device_map)
        if device_map != 'auto':
            vision_tower.to(device=device_map, dtype=torch.float16)
        image_processor = vision_tower.image_processor

    # 返回的max_sequence_length为model.config.max_sequence_length，如果没有这个属性则无脑返回2048
    if hasattr(model.config, "max_sequence_length"):
        context_len = model.config.max_sequence_length
    else:
        context_len = 2048

    return tokenizer, model, image_processor, context_len

def merge_lora(args):
    model_name = get_model_name_from_path(args.model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu')

    model.save_pretrained(args.save_model_path)
    tokenizer.save_pretrained(args.save_model_path)

模型测试

测试模型的性能，会发现微调起了作用：

from llava.eval.run_llava import eval_model

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "Who is in the picture?"
image_file = "https://avatars.githubusercontent.com/u/86307756"

args = type('Args', (), {
    "model_path": "./checkpoints/llava-v1.5-7b",
    "model_base": None,
    "model_name": "liuhaotian/llava-v1.5-7b",
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

print("原始模型输出为：")
eval_model(args)

args = type('Args', (), {
    "model_path": "./checkpoints/llava-v1.5-7b-merged",
    "model_base": None,
    "model_name": "liuhaotian/llava-v1.5-7b",
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

print("微调后的模型输出为：")
eval_model(args)

模型经过微调后，对于我们的训练数据，能得到与标签一致的运行结果：

经过微调的模型输出：

The person in the picture is Nathida, who is a character in the Original God and its derivative works produced by Mihoyo. Her real name is Buyel, the grass god in the "Earthly Seven rulers", and is given the nickname of "Little Lucky Grass King" by the XuMi people, the youngest of the seven gods today.

而如果不经过微调，模型只会告诉你照片上有个小女孩。