AI 自动写商品文案!LLaMA Vision 微调全流程实战
目录
在本项目中,我们将详细讲解如何使用 Meta 发布的 Llama 3.2 Vision 模型,结合开源数据集和高效微调框架 Unsloth,构建一个 商品图像 → 文本描述 的智能系统。项目重点探索了如何微调多模态大模型以适配具体任务,解决图像内容向自然语言转换的关键技术挑战。
1.环境配置
首先打开终端,通过以下命令安装环境依赖:
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
pip install triton==3.0.0
pip install "unsloth[torch]" --upgrade
2.加载Llama 3.2 Vision 模型
LLaMA 3.2 Vision 是 Meta 在 LLaMA 3.1 基础上扩展推出的多模态大模型,具备同时处理图像与文本的能力。该模型通过引入视觉编码器和跨模态注意力机制,使其能够理解图像内容并生成自然语言描述,广泛应用于图像描述、图文问答、文档分析和视觉定位等任务。LLaMA 3.2 Vision 提供 11B 和 90B 两种参数规模,在多个行业标准的图文任务上表现优异,是当前开源社区中领先的视觉语言模型之一。在此项目中,我们将加载Unsloth 提供的 Llama-3.2-11B-Vision-Instruct 模型,该版本针对微调和推理进行了优化。为了减少内存使用和计算需求,我们将以 4 位量化方式加载模型。
from unsloth import FastVisionModel
import torch
# 本地模型文件夹的路径,根据实际情况修改
local_model_path = "/model-202507/Llama-3.2-11B-Vision-Instruct"
model, tokenizer = FastVisionModel.from_pretrained(
local_model_path,
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. /opt/conda/envs/unsloth-py311/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.7.1: Fast Mllama patching. Transformers: 4.53.1. \\ /| NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.546 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False] "-____-" Free license: http://github.com/unslothai/unslothUnsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Loading checkpoint shards: 100%|██████████| 5/5 [00:42<00:00, 8.59s/it]
3.配置 LoRA 微调模块
为了使用 LoRA 训练模型,我们将重点选择和微调特定组件,如视觉层、语言层、注意力模块和 MLP 模块。这使我们能够以最小的架构更改来调整模型以适应特定任务。
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True,
finetune_language_layers = True,
finetune_attention_modules = True,
finetune_mlp_modules = True,
r = 16,
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
random_state = 3443,
use_rslora = False,
loftq_config = None,
)Unsloth: Making `model.base_model.model.model.vision_model.transformer` require gradients
4.数据加载
从本地加载amazon-product-descriptions-vlm数据集,并选择前 500 个样本。该数据集包含大量亚马逊商品图片及其对应描述,任务目标是构建一个能够根据商品图片生成描述的系统,帮助电商平台提升用户体验与运营效率。
from datasets import load_dataset
dataset = load_dataset("/Dataset/amazon-product-descriptions-vlm/",
split = "train[0:500]")
datasetGenerating train split: 100%|██████████| 1345/1345 [00:00<00:00, 9151.79 examples/s] Dataset({ features: ['image', 'Uniq Id', 'Product Name', 'Category', 'Selling Price', 'Model Number', 'About Product', 'Product Specification', 'Technical Details', 'Shipping Weight', 'Variants', 'Product Url', 'Is Amazon Seller', 'description'], num_rows: 500 })
查看数据集中某个产品的图像和对应的描述示例:
dataset[100]["image"]
dataset[100]["description"]'Unleash the power of the iconic Hot Wheels Monster Trucks Twin Mill! This 1:24 scale die-cast vehicle features incredible detail and is perfect for thrilling stunts and imaginative play. Collect them all!'
5.数据预处理:构造对话式训练样本
在使用多模态大模型进行指令微调(Instruction Tuning)时,我们通常需要将图像与文本结合成一种模型可以理解的对话格式输入,以模拟人类与 AI 的交互场景。以下代码将原始的图像 + 商品描述样本转换为类似 OpenAI 对话结构的格式:
instruction = """
You are an expert Amazon worker who is good at writing product descriptions.
Write the product description accurately by looking at the image.
"""
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": sample["description"]}],
},
]
return {"messages": conversation}
pass
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
converted_dataset[100]{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': '\nYou are an expert Amazon worker who is good at writing product descriptions. \nWrite the product description accurately by looking at the image.\n'}, {'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Unleash the power of the iconic Hot Wheels Monster Trucks Twin Mill! This 1:24 scale die-cast vehicle features incredible detail and is perfect for thrilling stunts and imaginative play. Collect them all!'}]}]}
6.模型微调前测试
我们将从数据集中选择第100个样本并对其进行推理,以评估其微调前开箱即用地编写产品描述的能力。
FastVisionModel.for_inference(model)
image = dataset[100]["image"]
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": instruction},
],
}
]
input_text = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=128,
use_cache=True,
temperature=1.5,
min_p=0.1
)The image depicts a toy monster truck, showcasing its vibrant orange body with silver accents. The vehicle features three large, black tires with orange rims, giving it a rugged and off-road-ready appearance. In addition to its impressive wheel arrangement, the truck is equipped with two silver engines visible under the hood, suggesting a dynamic and powerful design. A skull decal on the hood adds a touch of edginess, while a yellow and black logo on the front adds a pop of color. The toy truck appears to be part of a "Hot Wheels" series, as suggested by the "Hot Wheels" logo on the side. The overall design
描述虽然较长,但结构松散,出现了品牌冗余词、无关图案等现象,需要进一步优化。
7.模型微调
将模型设置为训练模式,并初始化一个监督微调(SFT)训练器,以准备在自定义数据整理器、数据集和优化的训练配置上训练视觉模型,以实现高效微调。
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
FastVisionModel.for_training(model)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=converted_dataset,
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=30,
learning_rate=2e-4,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
logging_steps=5,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
report_to="none",
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
dataset_num_proc=4,
max_seq_length=2048,
),
)
启动训练:
trainer_stats = trainer.train()==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 500 | Num Epochs = 1 | Total steps = 30 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 67,174,400 of 10,737,395,235 (0.63% trained) `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. Unsloth: Will smartly offload gradients to save VRAM!
[30/30 05:42, Epoch 0/1]
| Step | Training Loss |
|---|---|
| 5 | 3.354900 |
| 10 | 2.174600 |
| 15 | 1.310900 |
| 20 | 1.113800 |
| 25 | 1.023700 |
| 30 | 1.047700 |
训练过程中损失平稳下降,说明模型在学习图像与描述之间的映射关系。
8.微调后效果对比
再次使用第 100个样本进行测试:
FastVisionModel.for_inference(model)
image = dataset[45]["image"]
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": instruction},
],
}
]
input_text = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=128,
use_cache=True,
temperature=1.5,
min_p=0.1
)Add the iconic Alex Bowman 88 1:24 die-cast car to your collection. This 1:24 scale NASCAR 88 88 NATIONWIDE CHEVY SS is an Official Lionel Collectibles product, capturing the sleek Chevrolet Camaro design. Perfect for collectors and enthusiasts!<|eot_id|>
结果显著改善:生成文本更加精准,风格贴近真实商品描述,但仍有少量冗余,建议继续训练完整数据集,训练 3-5 轮以获得最佳效果。