My Obsidian Blog

❯

❯

BLIP Bootstrapping Language Image Pre training for Unified Vision Language Understanding and Generation

BLIP-Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

2026年6月22日1分钟阅读

paper-reading
vision-language

Paper Card

Problem:

现有 vision-language 预训练方法架构 task-specific，难以统一支持 understanding 与 generation。
Web-scale image-text 数据噪声较大，影响监督信号质量。

Key Idea: 提出 CapFilt (Captioning and Filtering) 数据 bootstrapping 方法，通过生成并筛选 captions 提升训练数据质量，并结合统一的 multimodal encoder-decoder (MED) 架构支持多种任务。

Key Trick:

MED 架构：通过控制 SA 层类型和是否开启 CA 层，实现单模态编码、图文混合编码、解码的一体化，同时对应 ITC、ITM、LM 三种损失。
CapFilt 策略：训练专门的 Captioner 生成 synthetic captions，训练 Filter 进行筛选，实现数据集 bootstraping。

Limitation: CapFilt 质量依赖 captioner 和 filter，本质仍受 web 数据分布限制。

Paper Notes

Edit

继续完善这篇笔记

编辑本文快速补充

关系图谱

Paper Card
Paper Notes

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community