QY的个人博客

一些配置问题

发表于 2024-11-21 更新于 2025-03-13 分类于配置 638 2 分钟

记录我在学习和开发过程中遇到的各种乱七八糟的配置问题的解决方案

transformer

发表于 2024-10-22 更新于 2025-02-18 分类于论文 1.2k 4 分钟

为了提高提高序列到序列（sequence-to-sequence）模型在机器翻译任务中的性能，它提出了一种新的网络架构——Transformer，这种架构完全基于注意力机制，摒弃了传统的循环神经网络（RNN）和卷积神经网络（CNN）中的循环和卷积操作。通过这种方式，Transformer模型能够实现更高的并行化，从而在训练过程中显著减少时间消耗，同时在质量上也取得了优于现有最佳结果的表现。

阅读全文 »

LooseControl

发表于 2024-10-14 更新于 2025-02-18 分类于论文 1.7k 6 分钟

LOOSECONTROL 为基于扩散的图像生成提供广义深度调节。使用 LOOSECONTROL 以及文本指导，用户可以通过仅指定场景边界和主要对象的位置来创建复杂的环境（例如，房间、街景等）。使用 LOOSECONTROL 以及文本指导，用户可以通过仅指定场景边界和主要对象的位置来创建复杂的环境（例如，房间、街景等）。

阅读全文 »

diffusionGPT

发表于 2024-10-07 更新于 2025-02-18 分类于论文 950 3 分钟

提出一个利用大模型选取合适的扩散模型从而提生图像生成效果的系统

阅读全文 »

Compositional 3D-aware Video Generation with LLM Director

发表于 2024-09-26 更新于 2025-02-18 分类于论文 678 2 分钟

“we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models.” 我们提出了一种新颖的范式，该范式单独生成每个概念的三维表示，然后结合来自大型语言模型（LLM）和二维扩散模型的先验信息进行组合。

阅读全文 »

2404.07178v1_move_anything

发表于 2024-09-23 更新于 2025-02-18 分类于论文 1k 4 分钟

Move Anything with Layered Scene Diffusion

这篇论文中提到的相关研究可以分为两个主要方向：可控场景生成（Controllable Scene Generation）和基于扩散的图像编辑（Diffusion-based Image Editing）。

Background

早期的方法，如GANs，虽然能够提供一定程度的场景布局控制，但在生成图像质量方面存在限制。
扩散模型虽然在文本到图像的生成任务上显示出了前所未有的性能，但它们缺乏对生成过程的细粒度空间控制。
论文提出了通过在扩散采样过程中优化分层场景表示来实现空间内容解耦的可能性

Contributions

分层场景表示：提出了一种分层场景表示方法，将场景分解为多个层，每层代表一个对象，包括对象的掩码、位置偏移和特征图。
SceneDiffusion采样策略：开发了一种新的采样策略，在扩散步骤中随机采样多个布局，并使用预训练的文本到图像扩散模型对每个布局进行去噪，以优化分层场景表示。
图像扩散：在优化后的分层场景表示基础上，使用标准的图像扩散步骤进一步提高图像质量。
层外观编辑：允许通过修改局部文本提示来重新风格化或替换层中的物体，实现交互式的场景操作。

Method

对象在空间上是分离的，因此可以在场景中自由移动、调整大小和克隆。

分层场景表示：
- 将场景分解成多个层，每层代表一个不同的对象。
- 每层包含对象的掩码、位置偏移和特征图。
SceneDiffusion采样策略：
- 在扩散步骤中，随机采样多个布局，并为每个布局渲染不同的视图。
- 使用预训练的文本到图像扩散模型对每个视图进行单步去噪处理，得到去噪后的视图。
优化分层场景表示：
- 通过最小化去噪后的视图与原始视图之间的差异来优化特征图，更新分层场景表示。
图像扩散：
- 在SceneDiffusion优化步骤之后，使用标准的图像扩散步骤来进一步提高图像质量。
- 将中间层表示映射到图像空间，提高图像的清晰度和真实感。

Experiments

实验设置：
- 构建了一个包含1000个文本提示和超过5000张图像的数据集，这些图像附有图像标题、局部描述和掩码注释。
- 使用了多个评估指标，包括Mask IoU、Consistency、Visual Consistency、LPIPS、SSIM以及FID（用于图像质量评估）。
可控场景生成：
- 在随机目标布局上生成图像，并在不同布局中保持内容一致性。
- 与MultiDiffusion等基线方法进行比较，并展示了SceneDiffusion在所有评估指标上的优势。
对象移动：
- 在给定图像上移动对象，同时保持图像其余部分的相似性。
- 与基于修复（inpainting）的方法进行比较，并展示了SceneDiffusion在所有评估指标上的优越性能。
层外观编辑：
- 展示了通过修改局部文本提示来重新风格化或替换场景中对象的结果。
- 展示了跨场景转移对象外观的能力。
消融研究：
- 分析了SceneDiffusion方法的不同组成部分，包括多布局联合去噪、随机布局采样和图像扩散阶段的重要性。
- 探讨了不同数量的视图和图像扩散步骤对性能的影响。
定量结果：
- 提供了在可控场景生成和对象移动任务上的定量结果，包括FID、Mask IoU、Consistency、LPIPS和SSIM等指标的比较。
定性结果：
- 展示了生成的场景、对象移动、真实图像编辑以及与不同去噪器兼容性的定性结果。

2311.17126v1

发表于 2024-09-18 更新于 2025-02-18 分类于论文 335 1 分钟

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

core idea

text -> layout -> img

CoT，Chain-of-Thought

Our method leverages the Chain-of-Thought (CoT) prompting of LLMs to interpret text and generate spatially reasonable object layouts.

这种方法指导LLM将问题解构为逻辑步骤，在基于示例的学习有限的情况下特别有效。它通过使用简单的指导性句子来提示LLM顺序处理信息，或者通过一系列示例来逐步说明推理过程。模型在得出答案之前生成连续的推理链。

Two key factors enable LLMs to generate object layouts:

defining precise task instructions
providing sufficient in-context examples.

上图中三种prompt策略：(1) task instructions only; (2) task instructions followed by eight in-context examples, but without CoT reasoning; (3) task instructions followed by eight in-context examples, each come with a CoT reasoning.

LACA，Layout-Aware Cross-Attention

LACA, an adapter designed to incorporate spatial information from the given layouts explicitly into the Stable Diffusion models

在COCO2017数据集与GT布局匹配时，LACA的YOLO分低于GLIGEN。作者认为这是因为LACA的改编字幕不如原来的字幕连贯。
在FID分上LLM生成的布局不如GT，作者认为这种差异可能源于数据集的文本布局一致性，一致的文本布局减轻了扩散模型整合多种模态对象的负担。

pytorch

发表于 2024-09-16 更新于 2024-09-23 分类于学习笔记 275 1 分钟

pytorch 学习中

阅读全文 »

2305.15393v2_LayoutGPT

发表于 2024-09-13 更新于 2025-02-18 分类于论文 765 3 分钟

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Introduction

在视觉生成中达到高度的用户可控性通常需要复杂、细粒度的输入，如布局。然而，与简单的文本输入相比，这类输入给用户带来了很大的负担。笔者提出通过LLM来生成布局，即LayoutGPT。

LayoutGPT can convert challenging textual concepts to 2D layouts and generate free-form, detailed descriptions for each region. In contrast, existing methods are restricted to class labels and fail to reason over numerical and spatial concepts in text conditions.

Method

LayoutGPT Prompt Construction

CSS Structures. 图像布局与CSS格式化网页布局、定义HTML中img标记的各种属性的方式非常相似，所以在属性值映射时采用CSS样式的声明方法。
Task Instructions.笔者将任务指令前置到提示符中，以指定任务目标，定义标准格式，值的单位等。
Normalization，由于CSS的常见长度单位是像素（px），他们基于固定标量规范化每个属性值，并将值重新缩放到最大256px。

Experiment Setup

Datasets & Benchmarks: we propose NSR-1K, a benchmark that includes template-based and human-written (natural) prompts from MSCOCO.
Evaluation Metrics: precision, recall, and accuracy
Baselines :As we consider both layout evaluation and image evaluation, we compare LayoutGPT with end-to-end T2I models (Stable Diffusion, Attend-and-Excite) and two-stage systems that generate layouts first and then apply GLIGEN as the layout-to-image model. 还会和人工画的布局对比。

图片中参数的含义：

Numerical Reasoning（数值推理）：
- Precision（精度）：计算预测的对象中实际存在于真实情况中的百分比。
- Recall（召回率）：真实的对象被预测到的百分比。
- Accuracy（准确性）：基于生成的边界框计数和空间位置
- Acc. (GLIP)：使用 GLIP 检测结果计算的平均准确性，基于生成图像的边界框计数或空间关系。
- CLIP Sim.：文本提示和生成图像之间的 CLIP 余弦相似度，用于参考。
Spatial Reasoning（空间推理）：
- Accuracy（准确性）：根据 LLM 生成的布局和 GLIP 布局，评估预测布局中空间关系的准确性，计算具有正确空间关系的预测布局的百分比。
- Acc. (GLIP)：与数值推理中含义相同，基于 GLIP 的检测结果计算准确性。
- CLIP Sim.：与数值推理中含义相同，文本提示和生成图像的 CLIP 余弦相似度。

Evaluation Results

GPT-3.5 achieves the best performance in numerical reasoning while GPT-4 performs the best in generating correct spatial positions.
The discrepancy between layout accuracy and GLIP-based accuracy suggests that the bottleneck mainly stems from layout-guided image generation and GLIP grounding results.
The discrepancy between layout accuracy and GLIP-based accuracy suggests that the bottleneck mainly stems from layout-guided image generation and GLIP grounding results.
LayoutGPT can understand visual commonsense.

Application Scenarios

Dense Layout Planning: Though only a few objects are mentioned in the prompts, LayoutGPT predicts layouts for the whole scene and imagines common objects that are usually visible in each scene.
Text-based Inpainting: LayoutGPT can enrich the description of each object with details that are not mentioned in the prompt.
Counterfactual Scenarios: LayoutGPT manages to generate reasonable layouts on these challenging prompts and handles the relationship between objects well.

Indoor Scene Synthesis

While living rooms are challenging for both methods, LayoutGPT shows more significant improvement over ATISS as supervised methods tend to overfit in early epochs.

2305.13655v3_LMD

发表于 2024-09-13 更新于 2025-02-18 分类于论文 483 2 分钟

Core idea:

Text->Layout->Image

two stages:

text-grounded layout generation. Two key principles:
1. Each object instance is represented by a single bounding box
2. We leave no foreground objects specified in the boxes to the background caption to ensure all foreground objects are controlled by our layout-grounded image generator.（我们不会在背景说明中指定框内的前景物体，以确保所有前景物体都由我们的布局引导图像生成器控制）
layout-grounded image generation. Two steps:
1. generating masked latents(掩码潜变量) for each box specified in the layout, with attention control ensuring that the object is placed in the designated box;
2. composing the masked latents as priors to guide the image generation to adhere to the specified layout.

LMD+

LMD+: Our training-free controller integrate with training-based methods(GLIGEN)

Additional Capabilities of LMD

Instruction-based scene specification.可以让用户通过多轮的指令，得到想要的图像
1. 多轮请求后不会偏离原始形象
2. LMD可以处理涉及空间推理的请求
Supporting more languages.

Evaluation

our method is applicable to various diffusion models without additional training. 开始用了Stable Diffusion, 后来一些图用的模型是the largest Stable Diffusion model SDXL.

定性比较了 VisualChatGPT, GILL. 4个LMD能解决的图，它们只有一个图生成正确。
- VisualChatGPT, GILL都是运用了LLMs到图像生成管道中，而且都是用SD作为底层图像生成模型。
- 二者都是以文本嵌入给SD提供conditions，但这却忽略了SD对文本嵌入控制不足的问题。LMD直接进行空间控制，规避了这个问题。
四项基准测试中，LMD显著优于Stable Diffusion(SD)
use Stable Diffusion v1.5 as the base model , use gpt-4 for layout generation. 大大增强了相较于基础模型的提示跟随能力

Quantitative evaluation

We propose a text-to-image evaluation benchmark that includes four tasks：

negation
generative numeracy
attribute binding
spatial reasoning

。。。

Discussions

Since we use models off-the-shelf, the LLM may generate layouts that are ambiguous to the diffusion model. 每个物体的视角可能不同，拼凑出来的图片可能不符合逻辑。