Introduction to diffusion models course

Exported on 6/19/2025 at 08:08:05 GMT+8 from Cursor (1.0.0)

User

Notes Lectures Labs Instructors Introduction to Flow Matching and Diffusion Models MIT Computer Science Class 6.S184: Generative AI with Stochastic Differential Equations Diffusion and flow-based models have become the state of the art for generative AI across a wide range of data modalities, including images, videos, shapes, molecules, music, and more! This course aims to build up the mathematical framework underlying these models from first principles. At the end of the class, students will have built a toy image diffusion model from scratch, and along the way, will have gained hands-on experience with the mathematical toolbox of stochastic differential equations that is useful in many other fields. This course is ideal for students who want to develop a principled understanding of the theory and practice of generative AI.

Course Notes The course notes serve as the backbone of the course and provide a self-contained explanation of all material in the class. In contrast, lectures slides will generally not be self-contained and are intended to provide accompanying visualizations during the lecture. You may view the notes by clicking on the colored link below.

View the course notes here! Lectures Lecture Topic Slides Recording 1 Flow and Diffusion Models Introduction to generative models Ordinary and stochastic differential equations Sampling from flow and diffusion models [slides 1] 2 Constructing a Training Target Conditional and marginal probability paths Continuity and Fokker-Planck equations Marginal vector field and marginal score function [slides 2] 3 Training Flow and Diffusion Models Flow matching Score matching Various approaches to diffusion models [slides 3] 4 Building an Image Generator Guidance and conditional generation Neural network architectures Survey of state-of-the-art models [slides 4] 5 Generative Robotics Guest lecture by Benjamin Burchfiel (Toyota Research) Large behavior models Diffusion models for robotics N/A 6 Generative Protein Design Guest lecture by Jason Yim (MIT) Designing new proteins with AI Flow matching for protein structure generation [slides 6] Labs There are 3 labs given as exercises accompanying the class to give you hands-on practical experience. The labs will guide you through building a flow matching and diffusion model from scratch step-by-step. To do the exercises, perform the following steps: Click on the “Open in Colab” link to open the lab in Google Drive. Click on the “Open in Google Colaboratory” link at the center top of the page. A jupyter notebook should appear. Click on “File” → “Save a copy in Drive” to save a copy of the lab to your own Google Drive. Follow the instructions in the lab to complete the exercises. Lab 1: Working with SDEs Open in Colab Lab 2: Flow Matching and Score Matching Open in Colab Lab 3: Conditional Image Generation Open in Colab Stuck? Solutions can be found here.

Instructors This class was co-taught by Peter and Ezra. We are fortunate to have Tommi Jaakkola as our sponsor and advisor.

Instructor Photo Peter Holderrieth PhD Student

Instructor Photo Ezra Erives MEng Student

Prerequisites: Linear algebra, real analysis, and basic probability theory. Students should be familiar with Python and have some experience with PyTorch.

Questions? Email either Peter or Ezra!

Remark about LLMs: This course does not cover large language models (LLMs). LLMs involve discrete data such as text, while this course focuses on data lying in continuous spaces such as images, videos, and protein structures.

Acknowledgements We would like to thank the following individuals and organizations without whose support this course would not be possible: Professor Tommi Jaakkola without whose support this class would not be possible Lisa Bella, Ellen Reid, and everyone else at MIT EECS for their generous support Christian Fiedler, Tim Griesbach, Benedikt Geiger, and Albrecht Holderrieth for invaluable feedback on the lecture notes Elaine Mello from MIT Open Learning for support with lecture recordings Ashay Athalye from Students for Open and Universal Learning for helping to edit and publish lecture recordings Cameron Diao, Tally Portnoi, Andi Qu, Roger Trullo, Ádám Burián, Zewen Yang, and many others for their invaluable contributions to the labs The Missing Semester of Your CS Education upon whose website this one was inspired Participants in the original course offering (MIT 6.S184/6.S975, taught over IAP 2025), as well as readers like you for your interest in this course Thank you!

To cite this course, please use: @misc{flowsanddiffusions2025, author = {Peter Holderrieth and Ezra Erives}, title = {Introduction to Flow Matching and Diffusion Models}, year = {2025}, url = {https://diffusion.csail.mit.edu/} } Source code.

Licensed under CC BY-NC-SA.

这门课讲diffusion呢？

Cursor

是的，这门课确实讲解了 diffusion（扩散）模型。

MIT 6.S184《Introduction to Flow Matching and Diffusion Models》课程专门介绍了流匹配（flow matching）和扩散（diffusion）模型在生成式人工智能中的应用。课程内容包括：

生成模型的基本原理
常微分方程（ODE）和随机微分方程（SDE）
如何用流和扩散模型进行采样
扩散模型的训练方法（如score matching、flow matching等）
实际构建图像生成器
扩散模型在机器人、蛋白质设计等领域的应用

课程还包含了动手实验（Labs），带你从零实现扩散模型和流匹配模型。

总结：这门课系统地讲解了扩散模型的理论、数学基础和实际实现，非常适合想深入理解扩散模型的同学。

如果你对扩散模型感兴趣，这门课是很好的学习资源！

User

详细讲一下目录

Cursor

当然可以！以下是MIT 6.S184《Introduction to Flow Matching and Diffusion Models》课程的详细目录说明：

课程目录（Syllabus）

1. 课程简介

介绍生成模型的基本概念
课程目标与学习成果
适用对象与先修知识

2. 课程笔记（Course Notes）

课程的主干内容，系统讲解所有理论与方法
自成体系，适合自学
包含数学推导、公式、例子和代码片段

3. 讲座（Lectures）

讲次	主题	主要内容
1	Flow and Diffusion Models 流与扩散模型	- 生成模型简介 - 常微分方程（ODE）与随机微分方程（SDE） - 如何用流和扩散模型采样
2	Constructing a Training Target 构建训练目标	- 条件与边缘概率路径 - 连续性方程与Fokker-Planck方程 - 边缘向量场与边缘score函数
3	Training Flow and Diffusion Models 训练流与扩散模型	- Flow matching方法 - Score matching方法 - 各类扩散模型的训练方式
4	Building an Image Generator 构建图像生成器	- 条件生成与引导（guidance） - 神经网络架构 - 现有主流模型综述
5	Generative Robotics 生成式机器人	- 嘉宾讲座（Toyota Research） - 行为大模型 - 扩散模型在机器人领域的应用
6	Generative Protein Design 生成式蛋白质设计	- 嘉宾讲座（MIT） - AI辅助蛋白质设计 - 蛋白质结构生成中的流匹配

4. 实验（Labs）

实验	主题	主要内容
1	Working with SDEs 操作随机微分方程	- SDE基础 - SDE数值解法 - SDE在生成模型中的作用
2	Flow Matching and Score Matching 流匹配与score匹配	- 实现flow matching算法 - 实现score matching算法 - 理论与实践结合
3	Conditional Image Generation 条件图像生成	- 条件生成方法 - 用扩散模型生成图像 - 进阶实验与应用

每个实验都配有Google Colab笔记本，便于动手实践。

5. 教师团队（Instructors）

Peter Holderrieth（博士生）
Ezra Erives（硕士生）
顾问：Tommi Jaakkola 教授

6. 先修要求（Prerequisites）

线性代数
实分析
基础概率论
Python和PyTorch编程经验

7. 其他说明

课程不涉及大语言模型（LLM），专注于连续空间（如图像、视频、蛋白质结构）的生成模型
课程资料和代码开源，遵循CC BY-NC-SA协议

总结：
本课程从理论到实践，系统讲解了流匹配和扩散模型的原理、数学基础、训练方法及实际应用，并配有丰富的实验和代码资源，非常适合希望深入理解生成式AI的同学。

如需更详细的每一讲内容或实验内容，可以进一步展开说明！

User

Skip to content Navigation Menu leejet stable-diffusion.cpp

Code Issues 260 Pull requests 64 Discussions Actions Projects Security

Insights

Owner avatar stable-diffusion.cpp Public

leejet/stable-diffusion.cpp t Name Last commit message Last commit date vmobilis vmobilis fix missing argument in prototype of stbi_write_jpg (#613) 10c6501 · Mar 9, 2025 .github/workflows

chore: fix cuda on github action (#580)

Feb 5, 2025 assets

feat: add sd3.5 support (#445)

Oct 24, 2024 docs

feat: add PhotoMaker Version 2 support (#358)

Nov 23, 2024 examples

chore: silence some warnings about precision loss (#620)

Mar 9, 2025 ggml @ ff90529

sync: update ggml

Mar 1, 2025 thirdparty

fix missing argument in prototype of stbi_write_jpg (#613)

Mar 9, 2025 .clang-format

style: format code

Jan 29, 2024 .dockerignore

chore: add sd Dockerfile

Aug 22, 2023 .gitignore

feat: Control Net support + Textual Inversion (embeddings) (#131)

Jan 29, 2024 .gitmodules

sync: update ggml submodule url

Aug 3, 2024 CMakeLists.txt

chore: add global SYCL compile flags (#597)

Feb 22, 2025 Dockerfile

chore: add sd Dockerfile

Aug 22, 2023 Dockerfile.musa

chore: bump MUSA SDK version to rc3.1.1 (#599)

Feb 22, 2025 LICENSE

Initial commit

Aug 13, 2023 README.md

docs: fix typo in readme (#574)

Feb 22, 2025 clip.hpp

fix: fix embeddings with quantized models (#601)

Mar 1, 2025 common.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 conditioner.hpp

fix: suport sdxl embedddings (#621)

Mar 9, 2025 control.hpp

feat: support Inpaint models (#511)

Dec 28, 2024 denoiser.hpp

feat: implement DDIM with the “trailing” timestep spacing and TCD (#568)

Feb 22, 2025 diffusion_model.hpp

feat: support Inpaint models (#511)

Dec 28, 2024 esrgan.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 face_detect.py

feat: add PhotoMaker Version 2 support (#358)

Nov 23, 2024 flux.hpp

feat: support Inpaint models (#511)

Dec 28, 2024 format-code.sh

refactor: reorganize code and use c api (#133)

Jan 1, 2024 ggml_extend.hpp

fix: force binary mask for inpaint models (#589)

Feb 22, 2025 gits_noise.inl

fix: clang complains about needless braces (#618)

Mar 9, 2025 lora.hpp

feat: partial LyCORIS support (tucker decomposition for LoCon + LoHa …

Feb 22, 2025 mmdit.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 model.cpp

fix: ModelLoader::load_tensors duplicated check (#623)

Mar 9, 2025 model.h

sync: update ggml

Mar 1, 2025 pmid.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 preprocessing.hpp

fix: fix the issue with dynamic linking

Feb 25, 2024 rng.hpp

refactor: reorganize code and use c api (#133)

Jan 1, 2024 rng_philox.hpp

refactor: reorganize code and use c api (#133)

Jan 1, 2024 stable-diffusion.cpp

fix: correct img2img time (#616)

Mar 9, 2025 stable-diffusion.h

sync: update ggml

Mar 1, 2025 t5.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 tae.hpp

fix: fix race condition causing inconsistent value for decoder_only (…

Mar 1, 2025 unet.hpp

feat: support Inpaint models (#511)

Dec 28, 2024 upscaler.cpp

chore: change SD_CUBLAS/SD_USE_CUBLAS to SD_CUDA/SD_USE_CUDA

Dec 28, 2024 util.cpp

fix: make get_files_from_dir works with absolute path (#598)

Feb 22, 2025 util.h

feat: add PhotoMaker Version 2 support (#358)

Nov 23, 2024 vae.hpp

feat: flexible model architecture for dit models (Flux & SD3) (#490)

Nov 30, 2024 vocab.hpp

feat: add sd3 support (#298)

Jul 28, 2024 Repository files navigation

README

MIT license

stable-diffusion.cpp

Inference of Stable Diffusion and Flux in pure C/C++ Features

Plain C/C++ implementation based on ggml, working in the same way as llama.cpp

Super lightweight and without external dependencies

SD1.x, SD2.x, SDXL and SD3/SD3.5 support
    !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.

Flux-dev/Flux-schnell Support

SD-Turbo and SDXL-Turbo support

PhotoMaker support.

16-bit, 32-bit float support

2-bit, 3-bit, 4-bit, 5-bit and 8-bit integer quantization support

Accelerated memory-efficient CPU inference
    Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.

AVX, AVX2 and AVX512 support for x86 architectures

Full CUDA, Metal, Vulkan and SYCL backend for GPU acceleration.

Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
    No need to convert to .ggml or .gguf anymore!

Flash Attention for memory usage optimization

Original txt2img and img2img mode

Negative prompt

stable-diffusion-webui style tokenizer (not all the features, only token weighting for now)

LoRA support, same as stable-diffusion-webui

Latent Consistency Models support (LCM/LCM-LoRA)

Faster and memory efficient latent decoding with TAESD

Upscale images generated with ESRGAN

VAE tiling processing for reduce memory usage

Control Net support with SD 1.5

Sampling method
    Euler A
    Euler
    Heun
    DPM2
    DPM++ 2M
    DPM++ 2M v2
    DPM++ 2S a
    LCM

Cross-platform reproducibility (--rng cuda, consistent with the stable-diffusion-webui GPU RNG)

Embedds generation parameters into png output as webui-compatible text string

Supported platforms
    Linux
    Mac OS
    Windows
    Android (via Termux)

TODO

More sampling methods
Make inference faster
    The current implementation of ggml_conv_2d is slow and has high memory usage
Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
Implement Inpainting support

Usage

For most users, you can download the built executable program from the latest release. If the built product does not meet your requirements, you can choose to build it manually. Get the Code

git clone –recursive https://github.com/leejet/stable-diffusion.cpp cd stable-diffusion.cpp

If you have already cloned the repository, you can use the following command to update the repository to the latest code.

cd stable-diffusion.cpp git pull origin master git submodule init git submodule update

Download weights

download original weights(.ckpt or .safetensors). For example

Stable Diffusion v1.4 from https://huggingface.co/CompVis/stable-diffusion-v-1-4-original
Stable Diffusion v1.5 from https://huggingface.co/runwayml/stable-diffusion-v1-5
Stable Diffuison v2.1 from https://huggingface.co/stabilityai/stable-diffusion-2-1
Stable Diffusion 3 2B from https://huggingface.co/stabilityai/stable-diffusion-3-medium

curl -L -O https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt

curl -L -O https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors

curl -L -O https://huggingface.co/stabilityai/stable-diffusion-2-1/resolve/main/v2-1_768-nonema-pruned.safetensors

curl -L -O https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/sd3_medium_incl_clips_t5xxlfp16.safetensors

Build Build from scratch

mkdir build cd build cmake .. cmake –build . –config Release

Using OpenBLAS

cmake .. -DGGML_OPENBLAS=ON cmake –build . –config Release

Using CUDA

This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro’s package manager (e.g. apt install nvidia-cuda-toolkit) or from here: CUDA Toolkit. Recommended to have at least 4 GB of VRAM.

cmake .. -DSD_CUDA=ON cmake –build . –config Release

Using HipBLAS

This provides BLAS acceleration using the ROCm cores of your AMD GPU. Make sure to have the ROCm toolkit installed.

Windows User Refer to docs/hipBLAS_on_Windows.md for a comprehensive guide.

cmake .. -G “Ninja” -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1100 cmake –build . –config Release

Using MUSA

This provides BLAS acceleration using the MUSA cores of your Moore Threads GPU. Make sure to have the MUSA toolkit installed.

cmake .. -DCMAKE_C_COMPILER=/usr/local/musa/bin/clang -DCMAKE_CXX_COMPILER=/usr/local/musa/bin/clang++ -DSD_MUSA=ON -DCMAKE_BUILD_TYPE=Release cmake –build . –config Release

Using Metal

Using Metal makes the computation run on the GPU. Currently, there are some issues with Metal when performing operations on very large matrices, making it highly inefficient at the moment. Performance improvements are expected in the near future.

cmake .. -DSD_METAL=ON cmake –build . –config Release

Using Vulkan

Install Vulkan SDK from https://www.lunarg.com/vulkan-sdk/.

cmake .. -DSD_VULKAN=ON cmake –build . –config Release

Using SYCL

Using SYCL makes the computation run on the Intel GPU. Please make sure you have installed the related driver and Intel® oneAPI Base toolkit before start. More details and steps can refer to llama.cpp SYCL backend.

Export relevant ENV variables

source /opt/intel/oneapi/setvars.sh

Option 1: Use FP32 (recommended for better performance in most cases)

cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

Option 2: Use FP16

cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON

cmake –build . –config Release

Example of text2img by using SYCL backend:

download stable-diffusion model weight, refer to download-weight.

run ./bin/sd -m ../models/sd3_medium_incl_clips_t5xxlfp16.safetensors --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"

Using Flash Attention

Enabling flash attention for the diffusion model reduces memory usage by varying amounts of MB. eg.:

flux 768x768 ~600mb
SD2 768x768 ~1400mb

For most backends, it slows things down, but for cuda it generally speeds it up too. At the moment, it is only supported for some models and some backends (like cpu, cuda/rocm, metal).

Run by adding –diffusion-fa to the arguments and watch for:

[INFO ] stable-diffusion.cpp:312 - Using flash attention in the diffusion model

and the compute buffer shrink in the debug log:

[DEBUG] ggml_extend.hpp:1004 - flux compute buffer size: 650.00 MB(VRAM)

Run

usage: ./bin/sd [arguments]

arguments: -h, –help show this help message and exit -M, –mode [MODEL] run mode (txt2img or img2img or convert, default: txt2img) -t, –threads N number of threads to use during computation (default: -1) If threads <= 0, then threads will be set to the number of CPU physical cores -m, –model [MODEL] path to full model –diffusion-model path to the standalone diffusion model –clip_l path to the clip-l text encoder –clip_g path to the clip-l text encoder –t5xxl path to the the t5xxl text encoder –vae [VAE] path to vae –taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality) –control-net [CONTROL_PATH] path to control net model –embd-dir [EMBEDDING_PATH] path to embeddings –stacked-id-embd-dir [DIR] path to PHOTOMAKER stacked id embeddings –input-id-images-dir [DIR] path to PHOTOMAKER input id images dir –normalize-input normalize PHOTOMAKER input id images –upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now –upscale-repeats Run the ESRGAN upscaler this many times (default 1) –type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_k, q3_k, q4_k) If not specified, the default is the type of the weight file –lora-model-dir [DIR] lora model directory -i, –init-img [IMAGE] path to the input image, required by img2img –control-image [IMAGE] path to image condition, control net -o, –output OUTPUT path to write result image to (default: ./output.png) -p, –prompt [PROMPT] the prompt to render -n, –negative-prompt PROMPT the negative prompt (default: “”) –cfg-scale SCALE unconditional guidance scale: (default: 7.0) –skip-layers LAYERS Layers to skip for SLG steps: (default: [7,8,9]) –skip-layer-start START SLG enabling point: (default: 0.01) –skip-layer-end END SLG disabling point: (default: 0.2) SLG will be enabled at step int([STEPS][START]) and disabled at int([STEPS][END]) –strength STRENGTH strength for noising/unnoising (default: 0.75) –style-ratio STYLE-RATIO strength for keeping input identity (default: 20%) –control-strength STRENGTH strength to apply Control Net (default: 0.9) 1.0 corresponds to full destruction of information in init image -H, –height H image height, in pixel space (default: 512) -W, –width W image width, in pixel space (default: 512) –sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm} sampling method (default: “euler_a”) –steps STEPS number of sample steps (default: 20) –rng {std_default, cuda} RNG (default: cuda) -s SEED, –seed SEED RNG seed (default: 42, use random seed for < 0) -b, –batch-count COUNT number of images to generate –schedule {discrete, karras, exponential, ays, gits} Denoiser sigma schedule (default: discrete) –clip-skip N ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1) <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x –vae-tiling process vae in tiles to reduce memory usage –vae-on-cpu keep vae in cpu (for low vram) –clip-on-cpu keep clip in cpu (for low vram) –diffusion-fa use flash attention in the diffusion model (for low vram) Might lower quality, since it implies converting k and v to f16. This might crash if it is not supported by the backend. –control-net-cpu keep controlnet in cpu (for low vram) –canny apply canny preprocessor (edge detection) –color Colors the logging tags according to level -v, –verbose print extra info

txt2img example

./bin/sd -m ../models/sd-v1-4.ckpt -p “a lovely cat”

./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p “a lovely cat”

./bin/sd -m ../models/sd_xl_base_1.0.safetensors –vae ../models/sdxl_vae-fp16-fix.safetensors -H 1024 -W 1024 -p “a lovely cat” -v

./bin/sd -m ../models/sd3_medium_incl_clips_t5xxlfp16.safetensors -H 1024 -W 1024 -p ‘a lovely cat holding a sign says "Stable Diffusion CPP"’ –cfg-scale 4.5 –sampling-method euler -v

./bin/sd –diffusion-model ../models/flux1-dev-q3_k.gguf –vae ../models/ae.sft –clip_l ../models/clip_l.safetensors –t5xxl ../models/t5xxl_fp16.safetensors -p “a lovely cat holding a sign says ‘flux.cpp’” –cfg-scale 1.0 –sampling-method euler -v

./bin/sd -m ..\models\sd3.5_large.safetensors –clip_l ..\models\clip_l.safetensors –clip_g ..\models\clip_g.safetensors –t5xxl ..\models\t5xxl_fp16.safetensors -H 1024 -W 1024 -p ‘a lovely cat holding a sign says "Stable diffusion 3.5 Large"’ –cfg-scale 4.5 –sampling-method euler -v

Using formats of different precisions will yield results of varying quality. f32 f16 q8_0 q5_0 q5_1 q4_0 q4_1

img2img example

./output.png is the image generated from the above txt2img pipeline

./bin/sd –mode img2img -m ../models/sd-v1-4.ckpt -p “cat with blue eyes” -i ./output.png -o ./img2img_output.png –strength 0.4

More Guides

LoRA
LCM/LCM-LoRA
Using PhotoMaker to personalize image generation
Using ESRGAN to upscale results
Using TAESD to faster decoding
Docker
Quantization and GGUF

Bindings

These projects wrap stable-diffusion.cpp for easier use in other languages/frameworks.

Golang: seasonjs/stable-diffusion
C#: DarthAffe/StableDiffusion.NET
Python: william-murray1204/stable-diffusion-cpp-python
Rust: newfla/diffusion-rs

UIs

These projects use stable-diffusion.cpp as a backend for their image generation.

Jellybox
Stable Diffusion GUI
Stable Diffusion CLI-GUI

Contributors

Thank you to all the people who have already contributed to stable-diffusion.cpp!

Contributors Star History

Star History Chart References

ggml
stable-diffusion
sd3-ref
stable-diffusion-stability-ai
stable-diffusion-webui
ComfyUI
k-diffusion
latent-consistency-model
generative-models
PhotoMaker

About

Stable Diffusion and Flux in pure C/C++ Topics flux ai cplusplus image-generation diffusion text2image image2image img2img txt2img latent-diffusion stable-diffusion ggml flux-dev flux-schnell Resources Readme License MIT license Activity Stars 4.2k stars Watchers 59 watching Forks 388 forks Report repository Releases 138 master-10feacf Latest Mar 9, 2025

137 releases Packages No packages published Contributors 42
@leejet @stduhpf @Green-Sky @ursg @phudtran @Cyberhan123 @FSSRepo @vmobilis @bssrdf @daniandtheweb @yeahdongcn @piallai @ring-c @idostyle
28 contributors Languages
C++ 100.0%

Terms
Privacy
Security
Status
Docs
Contact

除了cuda以外这些都是什么呀？

Cursor

你问的“除了cuda以外这些都是什么呀？”是指在 stable-diffusion.cpp 这个项目的构建说明里，除了 CUDA 以外的各种加速/后端选项。下面我来详细解释一下每个选项的含义和适用场景：

1. OpenBLAS

作用：使用 OpenBLAS 库进行 BLAS（基础线性代数子程序）加速，提升 CPU 上的矩阵运算速度。
适用：没有 GPU 或者只想用 CPU 的用户。
平台：Linux、Windows、Mac 都支持。

2. HipBLAS

作用：使用 AMD 的 ROCm 平台和 HipBLAS 库，在 AMD 显卡上进行加速。
适用：有 AMD 显卡的用户。
平台：主要是 Linux，部分 Windows（需特殊配置）。

3. MUSA

作用：为中国摩尔线程（Moore Threads）显卡提供加速，使用 MUSA 工具链。
适用：有摩尔线程显卡的用户。
平台：主要是 Linux。

4. Metal

作用：苹果自家的 GPU 加速框架，适用于 Mac 电脑（尤其是 M1/M2 芯片）。
适用：Mac 用户，尤其是 Apple Silicon。
平台：macOS。

5. Vulkan

作用：跨平台的高性能图形和计算 API，可以用来在多种 GPU（Nvidia、AMD、Intel）上加速。
适用：有支持 Vulkan 的显卡和驱动的用户。
平台：Windows、Linux、部分 Mac。

6. SYCL

作用：由 Intel 主导的跨平台并行编程框架，主要用于 Intel GPU（如 ARC、集成显卡），也支持部分其他平台。
适用：有 Intel GPU 或希望用 oneAPI 生态的用户。
平台：主要是 Linux，部分 Windows。

7. 其它相关参数

FP16/FP32：指的是浮点精度，FP16（半精度）更省内存，FP32（单精度）更精确。
Quantization：量化，指用更低位宽（如4位、8位）存储权重，进一步减少内存占用。
Flash Attention：一种优化注意力机制内存和速度的方法，部分后端支持。

总结

CUDA：Nvidia 显卡专用加速。
OpenBLAS：CPU 加速。
HipBLAS：AMD 显卡加速。
MUSA：摩尔线程显卡加速。
Metal：Mac（苹果）GPU加速。
Vulkan：跨平台 GPU 加速。
SYCL：Intel GPU/oneAPI 加速。

你可以根据自己的硬件环境选择合适的后端。如果你用的是 Nvidia 显卡，优先选 CUDA；如果是 AMD 显卡，选 HipBLAS；如果是 Mac，选 Metal；如果是 Intel GPU，选 SYCL；没有独显就用 OpenBLAS（CPU）。

如果还有具体某个参数不明白，可以继续问我！

User

跳到主要内容无障碍功能帮助无障碍功能反馈 gradio

登录全部图片视频新闻短视频网页图书更多工具 Gradio

Gradio https://gradio.app · 翻译此页 Present and share. Gradio can be embedded in Python notebooks or presented as a webpage. A Gradio interface can automatically generate a public link you can … Quickstart Creating A Chatbot Fast - Introduction - Sharing Your App Documentation Interface - Introduction - Blocks - ChatInterface - Image - Textbox Playground return “Hello " + name + “!” demo = gr.Interface(fn=greet, inputs … Interface Interface is Gradio’s main high-level class, and allows you to … Gradio Lite @gradio/lite is a JavaScript library that enables you to run Gradio … gradio.app站内的其它相关信息 » gradio-app/gradio: Build and share delightful machine …

GitHub https://github.com › gradio-app › gradio · 翻译此页 Gradio is an open-source Python package that allows you to quickly build a demo or web application for your machine learning model, API, or any arbitrary … 相关问题 What is Gradio used for? Which is better Gradio or Streamlit? Is Gradio owned by Huggingface? Is Gradio free to use? 反馈 Gradio - X

X · Gradio 超过 5.3万位关注者 Build and share machine learning apps in 3 lines of Python. Part of the @Huggingface family. DMs are open for sharing your gradio app with us! 视频

11:46 YouTube · Raj Kapadia Introduction to Gradio Web UI development | Gradio | Python YouTube · Raj Kapadia 2023年7月26日

6 重要时刻在此视频中

55:55 YouTube · HuggingFace [Launch] How to Build Machine Learning APIs Using Gradio YouTube · HuggingFace 2024年6月7日 10 重要时刻在此视频中

25:14 YouTube · HuggingFace Building an LLM Application with Gradio YouTube · HuggingFace 2024年7月6日 7 重要时刻在此视频中查看全部图片 Gradio Library-Interfaces for your Machine Learning Models Gradio Library-Interfaces for your Machine Learning Models

YouTube Open Source LLM Observability for Gradio - Langfuse Open Source LLM Observability for Gradio - Langfuse

Langfuse Creating Interactive Machine Learning Demos with Gradio … Creating Interactive Machine Learning Demos with Gradio …

GeeksforGeeks 显示更多图片 Gradio

Hugging Face https://huggingface.co › gradio · 翻译此页 Making machine learning more accessible through our open-source Python library. Rapidly create UIs to demo & deploy your model. Share with anyone, anywhere. Gradio

GitHub https://github.com › gradio-app · 翻译此页 Delightfully easy-to-use open-source tools that make machine learning easier and more accessible - Gradio. 用户还搜索了 gradio教程 gradio介绍 gradio是什么 gradio安装 gradio中文 gradio知乎 gradio框架 gradio中文文档 1 2 3 4 5 6 7 8 9 10 下一页香港 - 根据您的活动记录

更新位置信息帮助发送反馈隐私权条款

gradio是什么？

Cursor

Gradio 是一个开源的 Python 库，可以让你非常快速地为机器学习模型、API 或任意 Python 函数创建交互式的 Web 界面。你只需要几行代码，就能把你的模型变成一个可以在浏览器中交互的应用，并且可以一键分享给他人使用。

Gradio 的主要特点

快速搭建界面
只需几行代码，就能为你的模型或函数生成一个可视化的 Web 界面。
支持多种输入输出类型
支持文本、图片、音频、视频、滑块、下拉菜单等多种输入输出组件。
易于分享
Gradio 可以自动生成一个公开的分享链接，别人无需安装任何东西，直接在浏览器访问即可体验你的模型。
集成到 Jupyter Notebook
可以直接在 Jupyter Notebook 里运行和展示 Gradio 界面。
与 Hugging Face 集成
Gradio 是 Hugging Face 生态的一部分，很多 Hugging Face 上的模型 demo 都是用 Gradio 搭建的。

典型用法示例

import gradio as gr

def greet(name):
    return "Hello " + name + "!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
demo.launch()

运行后会弹出一个网页，你可以输入名字，点击按钮，看到输出。

适用场景

机器学习模型的演示和测试
快速搭建原型和小工具
让非技术人员也能体验和使用你的模型
作为 API 的可视化前端

总结

Gradio 就是让你用最简单的方式，把 Python 代码变成可交互的网页应用，尤其适合机器学习和 AI 相关的演示和分享。

如果你想让别人方便地体验你的模型，Gradio 是非常推荐的工具！

import gradio as gr

def greet(name):
    return "Hello " + name + "!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
demo.launch()

User

conditional和不是condictional的区别？

Cursor

你的问题非常好！下面用通俗易懂的方式解释一下**conditional（有条件的）和unconditional（无条件的）**生成的区别：

1. Unconditional Generation（无条件生成）

定义：模型直接从整体数据分布 ( p_{data} ) 里采样，不依赖任何额外信息或条件。
例子：你让一个生成模型“随便画一张图片”，它可能画出狗、猫、风景、汽车……什么都有可能。
公式：( z \sim p_{data} )
特点：没有输入提示，生成结果完全由数据分布决定。

2. Conditional Generation（有条件生成）

定义：模型在给定某个条件（如文本提示、类别标签等）下，从条件数据分布 ( p_{data}(\cdot|y) ) 里采样。
例子：你告诉模型“画一只狗”，它就只会画出各种各样的狗；你说“画一只猫”，它就只会画猫。
公式：( z \sim p_{data}(\cdot|y) )，其中 ( y ) 是条件（比如“Dog”、“Cat”、“Landscape”）。
特点：有输入提示，生成结果受条件控制，更可控、更符合需求。

图中解释

左边（无条件）：只采样“Dog”图片，没法指定别的内容。
右边（有条件）：可以根据不同的条件 ( y )（如“Dog”、“Cat”、“Landscape”）生成对应的图片。

总结一句话

Unconditional：啥都不说，模型自己决定生成什么。
Conditional：你给个条件（比如一句话或标签），模型按你的要求生成。

如果你想让AI“听你的话”生成内容，就需要用conditional generation。如果只是让AI“自由发挥”，那就是unconditional generation。

如需更深入的数学或代码解释，也可以继续问我！

User

Noise2Music: Text-conditioned Music Generation with Diffusion Models Qingqing Huang * 1 Daniel S. Park * 1 Tao Wang † 1 Timo I. Denk † 1 Andy Ly † 1 Nanxin Chen † 1 Zhengdong Zhang 1 Zhishuai Zhang 1 Jiahui Yu 1 Christian Frank 1 Jesse Engel 1 Quoc V. Le 1 William Chan 2 Zhifeng Chen 2 Wei Han † 1 Abstract We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground finegrained semantics of the prompt. Pretrained large language models play a key role in this story— they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music

Introduction Deep neural networks have been shown to have remarkable generative ability. In this work, we explore the generative capability of deep models for audio. We introduce Noise2Music, a diffusion-based (SohlDickstein et al., 2015; Song & Ermon, 2019; Ho et al.,

method of generating music from text prompts and demonstrate its capability by generating 30-second long 24kHz music clips. Modeling: We train a series of cascading diffusion models *Equal contribution †Core contributors 1Google Research 2Work done while at Google. Correspondence to: Qingqing Huang qqhuang@google.com. Preprint. (Ho et al., 2022b), where the first model learns the generative task of producing a compressed representation of a 30- second waveform from a text prompt, and the second model learns to generate a 16kHz waveform conditioned on the compressed representation and optionally the text prompt. We have investigated two options for the intermediate representation: a log-mel spectrogram, or a 3.2kHz waveform. 1D U-Nets are used for learning the noise vectors for the diffusion model. The diffusion models are conditioned on user prompts in the format of free-form text, which are encoded by a pre-trained language model (LM) and ingested by the 1D U-Net layers via cross attention. A final superresolution cascader is used to generate the 24kHz audio from the 16kHz waveform. Data mining: A large amount of training data is crucial for producing high-quality samples from a deep generative model. We employ a data mining pipeline to construct a large-scale training dataset of diverse music audio clips, each paired with multiple descriptive text labels. The text labels for the audio are generated by employing a pair of pretrained deep models: first, we use a large language model to generate a large set of generic music descriptive sentences as caption candidates; we then use a pre-trained music-text joint embedding model to score each unlabeled music clip against all the caption candidates and select the captions with the highest similarity score as pseudo labels for the audio clip. We are able to annotate O(150K) hours of audio sources this way to constitute our training data. MuLaMCap: As a by-product of this work, we introduce MuLan-LaMDA Music Caption dataset (MuLaMCap), consisting of O(400K) music-text pairs obtained by annotating the music content from AudioSet by the process described above. Compared to the original AudioSet ontology, where 141 out of 632 label classes are music related, the captions in MuLaMCap come from a large vocabulary consisting of 4 million music descriptive sentences and phrases, which have a much higher degree of diversity and granularity. We expect this dataset to be utilized for applications beyond sound classification, e.g., music captioning, retrieval or generation. Evaluation: We measure the quality of our text conditioned music generation model with two metrics: the Frechet Audio ´ Distance (FAD) (Kilgour et al., 2018) which measures how arXiv:2302.03917v2 [cs.SD] 6 Mar 2023 Noise2Music the quality of generated audio clips compare to that of two benchmark datasets, e.g. the music split of AudioSet (Gemmeke et al., 2017) and MagnaTagATune (Law et al., 2009); and the MuLan similarity score (Huang et al., 2022) which measures the semantic alignment between text prompts and the corresponding generated audio clips. Generative ability: Our models demonstrate that they can go beyond simple music attribute conditioning, e.g., genre, instrument, era, and are able to handle complex and fine-grained semantics which can reflect soft attributes such as atmosphere, feeling or activity. This is achieved by constructing a training dataset that not only relies on the metadata tags, but that also leverages the pretrained music-text joint embedding model to ground the semantics to audio features. Cherry-picked examples of music generated from text prompts can be found in google-research.github.io/noise2music#table-1 and googleresearch.github.io/noise2music#table-4.

Related Work Generative models: Deep generative models have a long and successful history in a wide range of domains. More recently, a significant amount of effort has been focused toward scaling up the dataset size for training models that can produce extremely high quality samples. Here we compile an incomplete list of such recent developments in text (Brown et al., 2020; Thoppilan et al., 2022), speech (Wang et al., 2018; Chen et al., 2021; Borsos et al., 2022), images (Ramesh et al., 2022; Saharia et al., 2022b; Yu et al., 2022), and audio (Briot, 2021; Dhariwal et al., 2020; MubertAI, 2022; Kreuk et al., 2022). Diffusion models: Diffusion models, introduced in (SohlDickstein et al., 2015; Song & Ermon, 2019; Ho et al.,

have shown the capability to generate high quality images (Ho et al., 2020; 2022b), audio (Yang et al., 2022; Popov et al., 2021) and video (Ho et al., 2022c;a). Cascaded diffusion models (Ho et al., 2022b; Saharia et al., 2022b), which uses a series of diffusion models to generate a lowfidelity image and refine the image in succession to produce a high-fidelity image, has been adapted to audio in this work. Audio generation: Various methods have been employed to generate audio conditioned on external input. Some relevant examples are provided in the context of the text-toaudio task, in which text-conditioned spectrogram generation and spectrogram-conditioned audio has been intensively studied (Popov et al., 2021; Chen et al., 2021; Kong et al., 2021; Wu & Shi, 2021; Chen et al., 2022). Restricting our attention to audio generation based on descriptive text, text conditioned general sound event generation has been approached with auto-regressive methods by AudioGen (Kreuk et al., 2022) as well as diffusion-based methods that operate on discrete audio codes by DiffSound (Yang et al., 2022). If we narrow our scope to music generation, Jukebox (Dhariwal et al., 2020), Mubert (MubertAI, 2022), and MusicLM (Agostinelli et al., 2023) have taken an autoregressive approach, while Riffusion (Forsgren & Martiros,
employed diffusion for spectrogram generation. Conditional signals in audio generation: Broadly speaking, two approaches have been taken on how the conditional signal, which steers the model to generate a specific style of music, is parameterized and communicated to an audio generation model. One approach is to project the signal to a pre-defined, interpretable embedding space— Jukebox (Dhariwal et al., 2020) relies on a fixed vocabulary of artists and genres mined from the training data to condition the decoder, while Mubert (MubertAI, 2022) matches the user prompt to a set of tags in a predefined vocabulary. The other, taken by works such as AudioGen (Kreuk et al.,
and MusicLM (Agostinelli et al., 2023) is to use a pre-trained text encoder to encode arbitrary user prompts. Authors’ Note: During the completion of this work, concurrent research which has overlap with this work has appeared (Schneider et al., 2023).

Methods 3.1. Diffusion models Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are powerful generative models that generate a sample by iteratively denoising random noise. Here we review the minimal amount of information on diffusion models required for understanding our work. More details can be found in the supplementary material. The input to a diffusion model, which we consider to be a generative model of some sample space, is the conditioning signal c, a randomly sampled time step t and a sample xt obtained by corrupting the original sample x via a Gaussian diffusion process with a noise schedule parameterized by the standard deviation σt of the noise at time t. The range of time t is set to be [0, 1], from which it is uniformly sampled during training, and the diffusion is viewed to progress in the direction of increasing time. The dynamics of Gaussian diffusion are well understood—the distribution of xt is completely parameterized by a single noise vector that belongs to a standard normal distribution, as xt maybe written as a function of the original sample, the deterministic noise schedule, and the noise vector , i.e., xt(x, σ, ), where it should be understood that σ is used to denote the entire noise schedule. The model θ is trained to identify the noise vector given this input. The diffusion loss can be written as Ex,c,,t wtkθ(xt, c, t) − k 2

, (1) where wt is a fixed weight function of choice. Noise2Music Inference is carried out by taking random noise at time t = 1 and denoising it by utilizing the noise predictions given by the model. We use ancestral (or DDPM) sampling (Ho et al., 2020), which provides a flexible framework for inference allowing multiple parameters that can affect the quality of the generated sample. First, the level of stochasticity of the denoising process can be controlled by varying the stochasticity parameter γ of the sampler. Also, an arbitrary denoising schedule can be used, where one may choose an arbitrary partition of the interval 0 = t0 < · · · < tn = 1 to discretize the denoising steps. Thus a variety of choices present themselves when one wishes to train a diffusion model. We utilize multiple options with respect to the following elements, further details of which can be found in the supplementary material: • Loss weight (wt): simplified weight wt = 1 (Ho et al., 2020) and sigma weight wt = σ 2 t • Variance schedule: linear (Ho et al., 2020) and cosine (Nichol & Dhariwal, 2021) schedules • Stochasticity parameter: γ = 0 or 1 • Denoising step schedule Classifier-free guidance (CFG): CFG (Ho & Salimans, 2022) is a method for improving the alignment between generated samples and conditional inputs. The conditional input of a portion of the training samples in each training batch are hidden from the network during training, enabling the network to learn how to predict the noise vector unconditionally and conditionally. At inference, the noise vector with and without the conditional input are computed, and the final noise vector applied is set to wθ(xt, c) + (1 − w)θ(xt, ·) with w > 1. Dynamic clipping (Saharia et al., 2022b) is applied to avoid over-saturation due to CFG. 3.2. Architecture We deploy the 1D Efficient U-Net, a one-dimension version of the Efficient U-Net introduced in (Saharia et al., 2022b), for the diffusion model. The U-Net model, depicted in Figure 1, consists of a series of down-sampling and upsampling blocks which are connected by residual connections. A down/up-sampling block consists of a down/upsampling layer followed by a series of blocks obtained by composing 1D convolutional layers, self/cross-attention layers and combine layers. The combine layer enables a single vector to interact with a sequence of vectors, where the single vector is used to produce a channel-wise scaling and bias. These blocks closely follow the structure of the blocks of the efficient U-Nets constructed in (Saharia et al., 2022b), with the two-dimensional convolutions replaced by their one-dimensional counterparts. The exact structure of the blocks are further elaborated in the supplementary material. There are four possible routes of entry to the model. The stacked input and output both consist of sequences of some length T, while the diffusion time t is encoded into a single time embedding vector and interacts with the model through the aforementioned combine layers within the down and up-sampling blocks. Given that we would like to produce a sequence of length T, the noisy sample xt is always part of the stacked input on the left-most side of the figure, while the output is interpreted as the noise prediction . For the cascading models, the low-fidelity audio on which the model is conditioned on can be up-sampled and stacked. Meanwhile, a sequence of vectors with an arbitrary length may interact with the blocks through cross-attention. This is the route through which the text prompts are fed into the model. There is also room for the model to be conditioned on an aligned, but compressed representation of the sequence by addition at the bottom of the “U” of the U-Net. 3.3. Cascaded diffusion We train two kinds of diffusion models in this work to produce high-quality 30-second audio from text prompts. Following (Ho et al., 2022b), we train generator models that generate some intermediate representation of the final audio conditioned on a text prompt, and cascader models that produce the final audio based on the intermediate representation. For the intermediate representation, we consider both low-fidelity audio and spectrograms. 3.3.1. WAVEFORM MODEL Generator Model: The generator model generates 3.2kHz audio that is conditioned on the text input. A sequence of vectors derived from the text input is produced and fed into the network as a cross-attention sequence. Cascader Model: The cascader model generates 16kHz audio that is conditioned on both the text prompt and the low-fidelity audio generated by the generator model based on the text prompt. The text conditioning takes place via cross attention. Meanwhile, the low-fidelity audio is upsampled and stacked with xt and fed into the model. The upsampling is done by applying fast Fourier transform (FFT) to the low-fi audio sequence and then applying inverse FFT to obtain the high-fi audio from the low-fi Fourier coefficients. 3.3.2. SPECTROGRAM MODEL Generator Model: This model generates a log-mel spectrogram conditioned on the text input. The spectrgram has 80 channels and a frequency of 100 features per second. The input and output sequences now have a channel dimension in addition to the sequence dimension. The pixel values of the log-mel spectrogram are normalized to lie within [−1, 1]. Text conditioning is achieved through cross attention. Noise2Music Figure 1. U-Net architecture used for the diffusion models. A series of down-sampling, then up-sampling blocks is applied to the main input sequence of length T to produce an output sequence of length T. The outputs of the down-sampling blocks are added to the outputs of up-sampling blocks via residual connections. There are four modes of inputs to the model. The first is the (possibly stacked) main input of sequence length T, entering on the left-hand side of the diagram. T is the target sequence length. Second, there is a time embedding vector. Third, there can be a text embedding sequence that can be attended to by the down/up-sampling blocks via cross attention. Lastly, there can be input of length T /C that is aligned with the sequence of length T with compression rate C. Vocoder Model: The vocoder model generates 16kHz audio that is conditioned only on the spectrogram, which is treated as aligned input. The down and up-sampling rates of the U-Net model are tuned to achieve the compression rate of the spectrogram against the audio. 3.3.3. SUPER-RESOLUTION CASCADER A final light-weight cascader is used to generate 24kHz audio from the 16kHz waveform produced by either model. The 16kHz audio is up-sampled and stacked with xt as input to the model. Text conditioning is not used for this model. 3.4. Text understanding It has been shown in the context of text-to-image diffusion models (Saharia et al., 2022b; Rombach et al., 2021) that powerful text encoders are able to capture the complexity and compositionality of music descriptive text prompts. We adopt the T5 encoder (Raffel et al., 2020) and use the nonpooled token embedding sequence to condition the diffusion models. A thorough comparison with alternative contextual signals such as embeddings from different large language models, or a single vector embedding derived from CLIPlike (Radford et al., 2021) text encoders trained on musictext pairs (Huang et al., 2022; Manco et al., 2022) is beyond the scope of this work. 3.5. Pseudo labeling for music audio Having large scale training data is a necessary component for ensuring the quality of generative deep neural networks. For example, Imagen (Saharia et al., 2022b) was trained on O(1B) image-text pairs. Despite the fact that music content is widely available, high quality paired music-text data is scarce, especially in the case of free-form text that describes the music attributes beyond high-level metadata such as title, artist name, album name, and release year. To generate such music-text pairs, we take a pseudo-labeling approach via leveraging MuLan (Huang et al., 2022), a pre-trained text and music audio joint embedding model, together with LaMDA (Thoppilan et al., 2022), a pre-trained large language model, to assign pseudo labels with finegrained semantic to unlabeled music audio clips. We first curate several music caption vocabulary sets, each consisting of a large list of music descriptive texts. As demonstrated below, these texts vastly differ from the captions from the label classes in standard music classification benchmarks, e.g., MagnaTagATune, FMA, and AudioSet, in their scale and the fine-grained semantic granularity. We consider the following three caption vocabularies: LaMDA-LF: We prime the large language model LaMDA to describe a list of 150k popular songs provided the song title and artist names. The precise prompt template is provided in the supplementary material. We then process the LaMDA responses into 4 million clean long-form sentences that are likely to be describing music. We use LaMDA as our LM of choice because it is trained for dialogue applications, and expect the generated text to be closer to user prompts for generating music. Rater-LF: We obtain 10,028 rater written captions from MusicCaps (Agostinelli et al., 2023), and split each caption into individual sentences. This produces 35,333 musicdescribing long-form sentences. Rater-SF: From the same evaluation set above, we collect all the short-form music aspect tags written by the raters, which amounts to a vocabulary of size 23,906. Examples of the caption vocabulary are presented in Table 1. Noise2Music Table 1. Caption vocabulary examples. Vocabulary Examples LaMDA-LF (4M) “A light, atmospheric drum groove provides a tropical feel.”, “A light EDM drumbeat carries a bass guitar, strings, a simple piano, and percussion in the background.” Rater-LF (35k) “A Scottish tenor drum plays a marching beat.” “A bass guitar with a punchy sound contrasts the guitar.” Rater-SF (24k) “50’s pop”, “wide passionate male vocal”, “vintage vibes”, “patriotic mood”, “vivacious cello”, “exercise music” We use the MuLan model as a zero-shot music classifier to assign captions from the vocabulary to unlabeled audio clips. MuLan consists of a text encoder and an audio encoder, which are trained on a large amount of highly noisy textmusic pairs with a contrastive learning scheme. Similar to how CLIP (Radford et al., 2021) co-embeds image and text, a 10-second long music audio clip and a sentence that describes the music are placed closely in the same semantic embedding space learned by MuLan. For each audio clip, we compute its audio embedding by first segmenting the clip into non-overlapping 10-second windows, and computing the average of the MuLan audio embeddings of each window. The text embeddings of all the candidate captions in the vocabulary are also computed. The top K captions that are closest to the audio in the embedding space are selected for each clip. We compute the frequency counts of the captions among all clips, and for each clip further sample K0 out of the K captions with probabilities inverse to the frequency counts. This last sampling step serves to balance the label distribution and increase the diversity of the captions. We use K = 10 and K0 = 3. As a warm up for pseudo-labeling our large training set, we produce MuLaMCap, a music captioning dataset derived from the AudioSet (Gemmeke et al., 2017), by applying this pseudo-labeling method to 388,262 / 4,497 examples from the AudioSet train / test sets which have labels in the music subtree of AudioSet ontology. Each 10-second audio with music content is associated with 3 captions from the LaMDA-LF vocabulary, 3 captions from Rater-LF, and 6 short form captions from Rater-SF. 3.6. Training data mining To assemble a large-scale collection of audio-text pairs, we collect approximately 6.8M music audio source files. From each soundtrack, we extract six non-overlapping 30-second clips. This amounts to nearly 340k hours of music. Audio is sampled at 24kHz for training the super-resoluton model and 16kHz for training all other models. For each soundtrack, we consider three types of noisy text labels—the song title, named entity tags associated with the soundtrack (e.g., genre, artist name, instrument), and the pseudo labels. We use three pseudo labels from the Table 2. Models trained in this work. The token length refers to the token length of the text prompts at training time. Model # Params # Training Token Loss Noise steps length weight schedule Waveform generator 724M 1.6M 64 Simplified Cosine Waveform cascader 487M 460k 64 Simplified Linear Spectrogram generator 745M 1.8M 96 Sigma Linear Spectrogram vocoder 25.7M 840k - Simplified Linear Super-resolution cascader 81M 270k - Simplified Linear LaMDA-LF vocabulary, and six pseudo labels from the Rater-SF vocabulary. The pseudo labels from LaMDA-LF and Rater-SF provide complementary information to the named entity tags. Compared to the objective and highlevel tags, the pseudo labels include subjective descriptions related to activity (“music for highway driving”) and mood (“a laid back feel”), and also include compositional elements with fine-grained semantics. Since we evaluate our model on MusicCaps (Agostinelli et al., 2023) from which the sentences of Rater-LF were derived, we exclude any pseudo labels from the Rater-LF vocabulary from our training data. We include a small amount of high-quality audio to the large pseudo-labeled training set. The audio is taken from a subset of music tracks, which does not require attribution, from an internally maintained music library. The music tracks are segmented to non-overlapping 30-second clips, while the metadata of the tracks are concatenated to form the text prompt of the audio. This contributes O(300) hours of annotated audio to our training data. 4. Experiments and Results 4.1. Model training details We train four 1D U-Net models, the waveform generator and cascader, and the spectrogram generator and vocoder for this work. We have summarized some basic information about the models in Table 2, while we relegate further details about the models to the supplementary material. We note that we found the sigma-weighted loss, which weighs the loss more heavily on the “back end” of the denoising schedule, crucial for convergence of the spectrogram generator. All the models, with the exception of the vocoder, are trained on audio-text pairs, while the vocoder is only trained on audio. For each audio sample, a text batch is formed. The three long prompts constitute three independent elements of the text batch, while the shorter prompts are concatenated, then segmented into a set token length reported in Table 2 and added to the text batch. For each audio clip, a random element of the corresponding text batch is selected at training time and fed to the model as the paired text to the audio. The models are trained with Adam optimization with β1 = 0.9 and β2 = 0.999. A cosine learning rate schedule with the end point set to 2.5 M steps is used with peak learning Noise2Music Table 3. Inference parameters for the models used in this work. Model Denoising Stochasticity CFG scale step schedule parameter Waveform generator Front-heavy 0 10.0 Waveform cascader Front-heavy 1 5.0 Spectrogram generator Back-heavy 0 5.0 Spectrogram vocoder Front-heavy 0 N/A Super-resolution cascader Front-heavy 0 N/A rate 1e-4 and 10k warm-up steps. An exponential moving average (EMA) of the model parameters are taken with decay rate 0.9999 and used at inference time. The superresolution cascader is trained with batch size 4096, while all other models use batch size 2048. To apply CFG at inference time, we occlude the text prompts for 10% of the samples in each training batch. For these samples, the output of the cross attention layers are set to zero. While the generator models use self-attention, the cascaders and vocoder do not. Thus while we need to train the generator models on the entire 30-second representation of the audio, the cascader and vocoder models are trained on 3 to 4-second randomly sampled snippets. Following (Ho et al., 2022b), two augmentations are applied at training time for the cascader/vocoder models. One is to randomly corrupt the conditioning low-fidelity audio or the spectrogram input by applying diffusion noise. To do so, a random diffusion time is chosen within [0, tmax] and applied to the intermediate representation of the audio, i.e., the upsampled low-fi audio or the spectrogram. For the cascader tmax is set to 0.5 while for the vocoder and super-resolution cascader it is set to 1.0. The other is blur augmentation. For the cascader model, a 1D blur kernel of size 10 is used with a Gaussian blur kernel whose standard deviation ranges from 0.1 to 5.0. For the vocoder model, a 2D 5x5 blur kernel is applied with the standard deviation ranging from 0.2 to 1.0. 4.2. Model inference and serving 4.2.1. MODEL INFERENCE We adjust three inference hyperparameters, the denoising schedule, the stochasticity parameter, and the CFG scale. The parameters used for each model are listed in Table 3. We parameterize the denoising step schedule by the time step sizes [δ1, · · · , δN ] that translate into denoising steps introduced in section 3.1 via accumulation: tn = Pn i=1 δn. The inference cost is proportional to the number of timesteps. Thus optimizing the time step schedule with a fixed inference cost amounts to distributing a fixed number of time steps that add up to the total time, 1. The parameter space for the denoising step schedule is extremely large. Nevertheless, we experiment with three different kinds of schedules we denote “front-heavy,” “uniform,” and “back-heavy.” The front-heavy schedule allots many steps to the “front” of the Table 4. Inference time cost on four TPU V4 for four samples. Model time/step (ms) steps time (s) Waveform generator 25.0 1000 25.0 Waveform cascader 75.0 800 60.0 Spectrogram generator 8.3 1000 8.3 Spectrogram vocoder 29.9 100 0.3 Super-resolution cascader 71.7 800 57.3 schedule near t = 0 whereas the “back-heavy” schedule expends more steps near t = 1. The uniform schedule uses evenly-spaced time steps. The exact schedules used are produced in the supplementary material. 4.2.2. MODEL SERVING We serve the models on Google Cloud TPU V4, where each service request generates four 30-second music clips. We apply GSPMD (Xu et al., 2021) to partition the model on four TPU V4 devices, reducing the serving time by more than 50%. Table 4 shows the inference time cost when the model is served on four TPU V4 to produce four samples. 4.3. Evaluation 4.3.1. PARAMETER SELECTION FOR THE MODELS Model parameters, including the architecture, training hyperparameters, checkpoints and inference parameters are selected in a heuristic fashion. A small set of dev prompts, independent of the prompts in any of the evaluation sets presented, are devised by the authors, which are used to generate audio from the trained models. Model parameters are selected based on the quality of the generation results, evaluated according the judgement of the authors, as well as practical limitations such as the availability of computational resources and time. Evaluations are conducted on 16kHz waveforms—the super-resolution cascader is not utilized to generate waveforms for producing evalution metrics. 4.3.2. EVALUATION METRICS We measure the quality of our text conditioned music generation model with two kinds of metrics: the Frechet Audio ´ Distance (FAD) (Kilgour et al., 2018) and the MuLan similarity score (Huang et al., 2022). FAD measures how the audio quality of the generated audio examples compare to that of a set of reference audio clips. In particular, an audio encoder is used to compute the audio embeddings of both the set of generated audio examples and the background audio clips in the evaluation dataset. Assuming that the distribution of the embeddings from each set are Gaussian, and the Freechet distance between the ´ two distributions are computed from the mean embedding vectors and the correlation matrix of the two sets. Three audio encoders are utilized for computing the FAD Noise2Music metric: a VGG1 audio event embedding model (Hershey et al., 2017) trained on YouTube-8M (Abu-El-Haija et al., 2016); the Trill (Shor et al., 2020) model2 , a convolutional speech representation learning model trained on speech containing clips from AudioSet; and the MuLan audio encoder. VGG and Trill produce frame-wise embeddings while MuLan’s embeddings are clip-wise. Since the audio encoders are trained on different datasets and tasks, FAD computed with those audio representations focus on different aspects of the audio. We hypothesize that FADVGG evaluates the general audio quality, FADTrill is more indicative of the vocal quality, and FADMulan captures global musical semantics. The contrastive model MuLan provides us a way to quantify the similarity between audio-text pairs as well as audioaudio pairs. For a given text-audio or audio-audio pair, we define the MuLan similarity as the cosine similarity between the MuLan embeddings of the two entities. For a given evaluation set of music-text pairs, we compute the average similarity between the audio generated from the text prompts of the dataset and either the text or the ground truth audio associated to the text. As a reference, we also compute the average MuLan similarity of the evaluation set against the ground truth audio, as well as a “random” audio pairing obtained by shuffling the ground truth audio. 4.3.3. EVALUATION DATASETS We report the FAD and MuLan similarity with respect to the following three datasets consisting of text-music pairs. First, we re-purpose the audio tagging benchmark MagnaTagATune (MTAT) (Law et al., 2009) to evaluate the 29-second long music clips generated by our models. MTAT contains 25,863 music clips, 21,638 of which are associated with multiple tags from a vocabulary of 188 music tags. We only utilize these 21,638 examples for evaluation, for each of which we concatenate the music tags with into a single string that we use as the associated text prompt. During evaluation, we generate a single 29-second long audio clip for the prompt associated with each of the 21,638 examples. Second, we use AudioSet-Music-Eval, the music portion of AudioSet (Gemmeke et al., 2017). There are 1,482 music related examples in the evaluation split of AudioSet, where each 10-second clip is associated with labels from the nontrivial part of the music subtree of AudioSet ontology. For each example in this set, we use the concatenated labels as the text prompt to generate a 30-second long clip, the middle 10-second portion of which is used for evaluation. Lastly, we evaluate on MusicCaps (Agostinelli et al., 2023) which consists of 5.5K 10-second clips from AudioSet paired with rater written captions. We use the rater 1 tfhub.dev/google/vggish/1 2 tfhub.dev/google/nonsemantic-speech-benchmark/trill/3 Table 5. The FAD between the reference dataset audio and the generated audio with prompts from the reference dataset. Three audio encoders, VGG, Trill and MuLan have been used to measure FAD. A lower value indicates better proximity of quality. Dataset/Model FADVGG FADTrill FADMuLan MusicCaps (Agostinelli et al., 2023) Riffusion (Forsgren & Martiros, 2022) 13.371 0.763 0.487 Mubert (MubertAI, 2022) 9.620 0.449 0.366 MusicLM (Agostinelli et al., 2023) 4.0 0.44 - Noise2Music Waveform 2.134 0.405 0.110 Noise2Music Spectrogram 3.840 0.474 0.180 AudioSet-Music-Eval Noise2Music Waveform 2.240 0.252 0.193 Noise2Music Spectrogram 3.498 0.323 0.276 MagnaTagATune Noise2Music Waveform 3.554 0.352 0.235 Noise2Music Spectrogram 5.553 0.419 0.346 captions as text prompts, and report the metrics with the middle 10-second excerpts of the generated audio samples. 4.4. Evaluation results In Table 5, we report the FAD of our models on the three evaluation datasets, and compare them with baseline models from Riffusion3 and Mubert4 . In Table 6, we report the average audio-text and audio-audio MuLan similarity scores between the generated audio and the evaluation datasets. We also include the metrics computed for the ground truth audio, as well as the shuffled ground truth. The evaluation metrics should be interpreted with care, since our result has potential advantages over the baselines presented. First, there is a possibility that our training data distribution is closer to the evaluation datasets compared to the baselines. Also, one may suspect that the MuLanbased metrics might be biased towards our models, since the MuLan model has been used to pseudo-label our data. The reader should thus be cautious to draw conclusions about the effectiveness of the methods used in this paper compared to those of the baselines based on these metrics. The metrics, however, are indeed representative of the performance of the trained models themselves in the AudioSet domain and provides a quantitative measure of final model performance. This ideally should hold true for the MuLan-based metrics as well, if we assume that the MuLan model, much like CLIP (Radford et al., 2021), has learned an un-biased, faithful representation of text and audio data. We also conduct human listening tests to measure the semantic alignment, whose setup is identical to that used in 3We query Mubert API at github.com/MubertAI as of Dec 24, 2022 to generate 10-second audio clips given the text prompts in the evaluation datasets. 4We ran inference with riffusion-model-v1 provided by github.com/riffusion/riffusion-app as of Dec 24, 2022 to generate 10-second audio clips. Noise2Music Table 6. The average MuLan similarity between the generated audio and either the text prompt or the ground truth audio for each evaluation set. A higher value indicates better semantic alignment. Dataset/Model audio ↔ gt-text audio ↔ gt-audio MusicCaps (Agostinelli et al., 2023) Ground Truth Audio 0.452 (1.000) Randomly Shuffled Audio 0.248 0.278 Riffusion (Forsgren & Martiros, 2022) 0.342 0.312 Mubert (MubertAI, 2022) 0.323 0.280 MusicLM (Agostinelli et al., 2023) 0.51 - Noise2Music Waveform 0.478 0.489 Noise2Music Spectrogram 0.434 0.464 AudioSet-Music-Eval Ground Truth Audio 0.470 (1.000) Randomly Shuffled Audio 0.274 0.265 Noise2Music Waveform 0.563 0.429 Noise2Music Spectrogram 0.490 0.389 MagnaTagATune Ground Truth Audio 0.498 (1.000) Randomly Shuffled Audio 0.277 0.315 Noise2Music Waveform 0.518 0.479 Noise2Music Spectrogram 0.459 0.444 Table 7. The number of wins in pair-wise comparisons of the human listening study. Higher indicates better semantic alignment. MusicCaps Noise2Music Waveform MusicLM Mubert Riffusion 959 718 692 254 308 (Agostinelli et al., 2023) with now five sources, listed in Table 7. The participants of the test are presented with a text caption from the MusicCaps evaluation set and 10-second clips from two different sources, then asked which clip is better described by the text of the caption on a 5-point Likert scale. We collect 3k ratings, with each source involved in 1.2k pair-wise comparisons. The the head-to-head comparisons between each pair of sources is evenly distributed. In Table 7, we report the total number of “wins” each model achieved among the 1.2k comparisons it has been subject to. Our waveform model shows comparable performance to MusicLM, while being behind the ground truth audio. 4.5. Inference parameter ablations We vary inference parameters of the models and observe its effects. We note that we have conducted ablations with model checkpoints that are slightly less-trained compared to the checkpoints used to produce the evaluation numbers in the previous subsection. The ablations are conducted with respect to the base parameters as listed in Table 3. In Figure 2, we depict how FAD measured with VGG and the MuLan similarity score change as the denoising step schedule and the CFG scale are varied during inference. Only one parameter is varied at a time, and all other parameters stay fixed at the baseline values. We find an overall correlation between the FAD metric and the similarity score, except in the case of the cascader, where FAD can get worse while the similarity score improves. We Figure 2. We plot how FADVGG and the MuLan similarity score vary as inference parameters are adjusted. The CFG parameters take values from [1, 2, 5, 10, 15], while “B”ack-heavy, “U”niform and “F”ront-heavy denoising step schedules have been applied. also find that there is an optimal CFG scale, and too big of a CFG scale hurts the generation quality. It can also be seen that the generator CFG scale is a bigger factor than the denoising schedule of the generator, while the impact of cascader denoising schedule is extremely large. Figure 3. Quality metrics of the generated examples plotted against computational cost parameterized by inference time. 4.6. Inference cost and performance In Figure 3, we plot the quality metrics against the inference cost measured by the inference time. We reduce/increase the number of inference steps of the generator or the cascader/vocoder and inverse-proportionally scale the step sizes in the inference schedule. We find that the effect of increasing the inference cost of the generator is mixed while the generative quality generally improves with more cascader/vocoder inference steps. 5. Qualitative analysis Content representation: We present generation examples at google-research.github.io/noise2music#table-2, to illustrate that the model is able to ground the music aspects represented in the text prompt. In particular, we find that the genre, instrument, mood, vocal traits, and era of music implied in the text is manifested in the generated music. Noise2Music Creative prompts: While our models often struggle to produce high quality audio from out-of-distribution prompts, they are nevertheless able to generate some interesting examples. In google-research.github.io/noise2music#table-3, we have collected examples of creative prompts for which the model was able to generate quality music. 6. Discussion Spectrogram vs. waveform approach: The spectrogram and waveform approaches have their comparative advantages. The spectrogram models employed in this work are much cheaper to train and serve compared to the waveform models, and are more scalable in time length. This is because the sequence length of the spectrogram is much shorter than that of a low-fi waveform. In addition, the spectrogram contains high-frequency information which is missing in the low-fidelity audio. Meanwhile, the waveform model produces interpretable representations at every step of the generation process, making the model easy to debug and tune. This is partially responsible for our ability to train the waveform models with more ease. Future directions: While we have demonstrated the potential of text prompt based music generation, there is much room for improvement beyond our work. Increasing model interpretability, further improving text-audio alignment, reducing training and inference cost, and scaling up the generation length of the audio are just a few directions in which our work needs to be improved. Another interesting direction is to fine-tune the models trained in this work for diverse audio tasks including music completion and modification, as was done for images by Saharia et al. (2022a). 7. Broader Impact We believe our work has the potential to grow into a useful tool for artists and content creators that can further enrich their creative pursuits. To live up to this promise, more work is needed with musicians and other stakeholders to develop models into a meaningful co-creation tool. We acknowledge the limitations of the proposed model. In particular, large generative models learn to imitate patterns and biases inherent in the training sets, and in our case, the model can propagate the potential biases built in the text and music corpora used to train our models. Such biases can be hard to detect as they manifest in often subtle, unpredictable ways, which are not fully captured by our current evaluation benchmarks. Demeaning or other harmful language may be generated in model outputs, due to learned associations or by chance. Beyond this, we recognize that musical genres are complex and key musical attributes are contextual and change over time. Training data reflect a limited corpus of musical samples and genres, given uneven recording and digitization of samples from global musical cultures. How music is categorized and labeled can essentialize genres; and these labels may be constructed and applied without the participation of communities. When readers examine the released generation examples in the accompanied website, we caution readers not to presume each sample can generalize to an entire musical genre or one label can capture the diversity of musical genres produced within a region (i.e. “Latin music” contains a broad range of cultures and styles). Moreover, musical samples may sound “authentic” to those outside these communities, as nuances in musical traditions need trained ears/cultural knowledge to recognize. In generating vocals, there may be possible caricatures, 11mock accents,” parodies, or other demeaning linguistic harms (e.g., “mock Black singing” in a request for “soulful vocals” or “mock Spanish” in a Latin music request) that arise in text prompts requesting cultural or religious musical genres, or genres that emerged as part of the political struggles of certain communities (e.g., Black American music, Nueva cancion, ´ Chicano folk, Brazilian Tropicalismo, Sufi Qaw). As is with any other technology, the result of our research can be misused or abused. We acknowledge the risk of potential misappropriation when the created content exactly matches examples in training data. In accordance with responsible model development practices, duplication checks are a built-in part of our current pipeline of producing and releasing examples, and will continue to be for any future work. Efforts for identifying potential safety issues and addressing them are important components for improving these generative models. Until there is a more clear understanding of the limitations and risks, we do not intend to release the model. Acknowledgements We are grateful to Aren Jansen for building MuLan, which is an indispensable component of this project. We give thanks to Austin Tarango, Fernando Diaz, Kathy Meier-Hellstern, Molly FitzMorris, and Renee Shelby for helping us incorporate important responsible AI practices around this project. We acknowledge support from Blake Cunningham, Cara Adams, for giving us advice along the project and assisting us with the publication process. We appreciate valuable feedback and support from Alex Ku, Andrea Agostinelli, Ankur Bapna, Chen Liang, Ed Chi, Ekin Dogus Cubuk, Erica Moreira, Esteban Real, Heiga Zen, Jaehoon Lee, James Qin, Nathan Park, Stephen Kelly, Thang Luoung, Weizhe Hua, Ye Jia, Yifeng Lu, Yonghui Wu, Yu Zhang, Yuma Koizumi. Special thanks to authors of MusicLM for helpful discussions and cooperation, and especially for sharing their evaluation set and manuscript before publication. Noise2Music References Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube8m: A large-scale video classification benchmark, 2016. URL https://arxiv.org/abs/1609.08675. Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., and Frank, C. Musiclm: Generating music from text. arxiv preprint arxiv:2301.11325, 2023. Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Teboul, O., Grangier, D., Tagliasacchi, M., and Zeghidour, N. Audiolm: a language modeling approach to audio generation, 2022. URL https://arxiv.org/abs/2209.03143. Briot, J.-P. From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Computing and Applications, 33(1):39–65, 2021. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. Chen, Z., Tan, X., Wang, K., Pan, S., Mandic, D., He, L., and Zhao, S. Infergrad: Improving diffusion models for vocoder by considering inference in training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8432–8436. IEEE, 2022. Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., and Sutskever, I. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020. Forsgren, S. and Martiros, H. Riffusion - Stable diffusion for real-time music generation. 2022. URL https: //riffusion.com/about. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017. Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R., and Wilson, K. Cnn architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. URL https:// arxiv.org/abs/1609.09430. Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022b. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022c. Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and Ellis, D. P. W. Mulan: A joint embedding of music audio and natural language. In Proceedings of the the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022. Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Defossez, ´ A., Copet, J., Parikh, D., Taigman, Y., and Adi, Y. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022. Law, E., West, K., Mandel, M. I., Bay, M., and Downie, J. S. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pp. 387–392, 2009. Manco, I., Benetos, E., Quinton, E., and Fazekas, G. Contrastive audio-language learning for music, 2022. URL https://arxiv.org/abs/2208.12208. MubertAI. https://github.com/mubertai/mubert-text-tomusic. 2022. Noise2Music Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021. Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudinov, M. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599–8608. PMLR, 2021. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models, 2021. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., and Norouzi, M. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10, 2022a. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b. Schneider, F., Jin, Z., and Scholkopf, B. Mo ¨ usai: Text- ˆ to-music generation with long-context latent diffusion, 2023. Shor, J., Jansen, A., Maor, R., Lang, O., Tuval, O., de Chaumont Quitry, F., Tagliasacchi, M., Shavitt, I., Emanuel, D., and Haviv, Y. Towards Learning a Universal NonSemantic Representation of Speech. In Proc. Interspeech 2020, pp. 140–144, 2020. doi: 10.21437/Interspeech. 2020-1242. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. Wang, Y., Stanton, D., Zhang, Y., Ryan, R.-S., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F., and Saurous, R. A. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, pp. 5180–5189. PMLR, 2018. Wu, S. and Shi, Z. It\ˆ otts and it\ˆ owave: Linear stochastic differential equation is all you need for audio generation. arXiv preprint arXiv:2105.07583, 2021. Xu, Y., Lee, H., Chen, D., Hechtman, B. A., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., Pang, R., Shazeer, N., Wang, S., Wang, T., Wu, Y., and Chen, Z. GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.org/abs/2105.04663. Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., and Yu, D. Diffsound: Discrete diffusion model for textto-sound generation. arXiv preprint arXiv:2207.09983, 2022. Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022. Noise2Music A. Diffusion models In this section, we review some relevant information for diffusion models and set up the notation used in the main text of the paper. We follow (Saharia et al., 2022b) in our presentation. A diffusion model assumes a set-up where a sample x from a distribution corrupted by a Gaussian diffusion process with a noise schedule, represented by a monotonically increasing standard deviation σt at time t. More precisely, the distribution for the corrupted sample xt at time t conditioned on x or xs for s < t is given by: q(xt|x) = N (αtx, σ2 t I), q(xt|xs) = N ((αt/αs)xs, σ2 t|s I). (2) We can define a set of variables that will make the equations cleaner: αt = q 1 − σ 2 t , λt = ln(α 2 t /σ2 t ), σ2 t|s = (1 − e λt−λs )σ 2 t , σ˜ 2 s|t = (1 − e λt−λs )σ 2 s . (3) The time variables s and t are assumed to be in the range [0, 1]. As seen in the first equation, the randomness of the corrupted sample xt is encoded in a single noise vector ∼ N (0, I). The aim of the diffusion model is to model this noise vector, given the corrupted sample, the time t and the context c: θ(xt, c, t). Once the model is trained, we sample noise at time t = 1, i.e., x1 ∼ N (0, I), and reverse the diffusion process to produce a “clean” sample x0 from the original distribution. We employ ancestral (or DDPM) sampling (Ho et al., 2020) to do so. In this sampling method, we select some time steps 0 = t0 < · · · < tN = 1 and reverse the diffusion process by applying the update rule to obtain xs from xt: xs = αs αt xt − (1 − e λt−λs ) · αs αt · σt · θ(xt, c, t) + ˜σ 1−γ s|t · σ γ t|s · ˜, (4) where s = tN−n and t = tN−n+1 at the n-th update step. Here, ˜ is a random standard normal vector sampled at each inference step. γ is a hyperparameter that controls the stochasticity of the diffusion process. As γ increases, a larger variance is introduced at a given inference step since σt|s > σ˜s|t for t > s. A.1. Noise schedules We utilize two noise schedules—the linear (Ho et al., 2020) and cosine (Nichol & Dhariwal, 2021) schedules. The linear schedule (Ho et al., 2020) is defined to be such that σ 2 t increases in a linear fashion with respect to time: σ 2 t = (σ 2 1 − σ 2 0 )t + σ 2 0 . (5) We use the initial and final values σ 2 0 = 0.0001 and σ 2 1 = 0.02 in this work. We use a slight variant of the cosine schedule (Nichol & Dhariwal, 2021): αt = cos(at + b) (6) where a = arctan e 10 − arctan e −10 and b = arctan e −10 are taken so that λt=0 = 20 and λt=1 = −20. A.2. Denoising step schedules Here we list the actual denoising step schedules used at inference time for the generator and cascader/vocoder models. These schedules have been hand-tuned by trial and error as explained in section 4.3.1. The numerical values of the time step sizes for each model and schedule used for ablation is given in python notation in Table 8. The schedules used for final evaluation are italicized. Notice for the spectrogram generator, the back-heavy schedule has been modified further for evaluation. Meanwhile, a plot of the denoising time reached at a given inference step for each model schedule is plotted in Figure 4. B. Model architecture We present some details on the architecture of the models used in the paper. The overall structure of the U-Net has been depicted in Figure 1. After an entry convolutional layer is applied to the input, the input is passed through a series of Noise2Music Table 8. The denoising time steps values for denoising schedules. The schedules selected for use for each model are italicized. Model Schedule Time steps Waveform generator Front-heavy front heavy = [0.01 / 200] * 200 + [0.04 / 400] * 400 + [0.15 / 200] * 200 + [0.3 / 150] * 150 + [0.5 / 50] * 50 Back-heavy back heavy = front heavy[::-1] Uniform uniform = [1.0 / 1000] * 1000 Waveform cascader Front-heavy front heavy = [0.05 / 400] * 400 + [0.15 / 200] * 200 + [0.3 / 150] * 150 + [0.5 / 50] * 50 Back-heavy back heavy = front heavy[::-1] Uniform uniform = [1.0 / 800] * 800 Spectrogram generator Front-heavy front heavy = [0.01 / 400] * 400 + [0.04 / 800] * 800 + [0.15 / 400] * 400 + [0.3 / 300] * 300 + [0.5 / 100] * 100 Back-heavy back heavy = front heavy[::-1] Uniform uniform = [1.0 / 1000] * 1000 Back-heavy back heavy eval = [0.3 / 50] * 50 + [0.3 / 150] * 150 + [0.2 / 300] * 300 + [0.2 / 500] * 500 Spectrogram vocoder Front-heavy front heavy = [0.05 / 50] * 50 + [0.15 / 30] * 30 + [0.3 / 15] * 15 + [0.5 / 5] * 5 Back-heavy back heavy = front heavy[::-1] Uniform uniform = [1.0 / 100] * 100 Super-resolution cascader Front-heavy front heavy = [0.05 / 400] * 400 + [0.15 / 200] * 200 + [0.3 / 150] * 150 + [0.5 / 50] * 50 Figure 4. The denoising time reached at a given inference step for the four denoising time step schedules used in the paper. A front-heavy schedule expends most of its steps near t = 0 while a back-heavy schedule expends most of its steps near t = 1. down-sampling and up-sampling layers with convolutional blocks in between. Here we expand upon how the convolutional blocks in the down-sampling/up-sampling portion of the U-Nets are structured, and explain how each network utilized in the generation pipeline is configured. B.1. Down-sampling and up-sampling blocks The model architecture closely follows that of the efficient U-Net (Saharia et al., 2022b), with two-dimensional convolutional layers replaced by one-dimensional convolutional layers. There are small differences, which we review here. As seen in Figure 1, the U-Net model is a mirror image of itself. Given the model depth D, the model employs D downsampling and D up-sampling layers, which we may label by I = 1, · · · , D. Each down-sampling layer is a one-dimensional convolutional layer with stride SI . Thus, denoting the feature length and feature dimension of the forward-propagated input at depth-I to be TI and CI , the down-sampling layer takes an input with (length, channel) dimensions (TI−1, CI−1) and maps it to an output with dimensions (TI , CI ) with TI = TI−1/SI . The up-sampling layer does the mirror operation, where an input with (TI , CI ) dimensions is mapped to an output with (TI−1, CI−1) dimensions, with up-sampling stride SI . Between the down-sampling and up-sampling layers, “convolutional blocks” of uniform dimension are used. The exact same block is used for both the down-sampling and the up-sampling portion of the U-Net, and the same number of blocks are employed at the same depth. The structure of a block is depicted in Figure 5. While all blocks interact with the time embedding vector—obtained by converting the float into a vector via positional embedding and applying a linear layer—through the “combine embedding” layer, the self attention and the cross attention layers are only turned on for selected depths, or not even used at all in some cases. The “combine embedding” layer applies a fully connected layer to the time embedding to compute a channel-wise scaling and bias vector, which is applied to the input sequence. The self and cross attention layers include the standard post-attention residual layer with a hidden layer of twice the dimension of CI . Meanwhile, the entry convolutional layer and the exit convolutional layer both have kernel size 3. The “zeroth” channel width of the input to the first down-sampling layer and the output of the last up-sampling layer are both set to a “base model dimension”. The time embedding dimension is also set to this value. Noise2Music Figure 5. The structure of the convolutional blocks that form the base unit of operation in the 1D U-Nets. The self and cross attention layers are optional. Let us now summarize the key hyperparameters that determine the architecture of the network. • Base model dimension • Convolutional kernel size • Depth of the network D • Length-D list of down-sampling factors • Length-D list of number of blocks used at each depth • Length-D list of bools indicating whether self/cross-attention is used at a given depth • Number of heads used for attention The rest of the architecture is fixed. B.2. Model specifications The architectural parameters of the models used in this paper can be summarized by Table 9. Table 9. Architectural parameters for the models used in this work. Parameter Waveform generator Waveform cascader Spectrogram generator Spectrogram vocoder Super-resolution cascader Base dimension 256 256 256 128 256 Kernel Size 7 7 9 5 7 Depth 6 5 5 4 4 Strides [4, 4, 4, 4, 4, 4] [4, 4, 4, 4, 4] [4, 3, 2, 2, 2] [2, 4, 4, 5] [4, 4, 4, 4]

Channels 256 * [1, 1, 2, 3, 4, 4] 256 * [1, 1, 2, 3, 4] 256 * [1, 2, 2, 3, 4] 128 * [1, 1, 2, 4] 256 * [1, 1, 2, 2]

Blocks [3, 3, 3, 4, 4, 4] [3, 3, 4, 6, 6] [6, 6, 6, 6, 6] [4, 4, 4, 4] [4, 4, 4, 4]

Self Attention [F, F, F, T, T, T] [F, F, F, F, F] [F, F, F, T, T] [F, F, F, F] [F, F, F, F] Cross Attention [F, F, F, T, T, T] [F, F, F, T, T] [F, F, F, T, T] [F, F, F, F] [F, F, F, F] Attention Heads 8 8 8 - - C. Prompt template to prime LaMDA model to generate music descriptive text “Walking on Sunshine” by Katrina & The Waves : The song is a pop / rock song. It has a happy, upbeat mood, with a driving bassline and a simple, repetitive drumbeat. The song is backed by a synthesizer and a guitar. “Born This Way” by Lady Gaga : The electropop song is backed by rumbling synth sounds, a humming bass and additional chorus percussion, with sole organ toward the end. The female singer has a confident and strong voice. Noise2Music “Till I Collapse” by Eminem : The song is a hip hop song. It has a strong, aggressive mood, with a powerful, energetic beat. “My Heart Will Go On” by Celine Dion : The pop song has a sad and heartbroken mood. It contains heavy emphasis on the instrumental arranging. Usage of Tin Whistle is prominent, backed by melodic use of strings and rhythm guitars. The song features both acoustic and electronic instrumentation. The female singer’s vocal performance is emotional demanding. “Me Too” by Meghan Trainor : The pop song is a mid-tempo pop song. It has a happy, upbeat mood, with a driving bassline and a simple, repetitive drumbeat. The song is backed by a synthesizer and a guitar. “Nuvole Bianche” by Einaudi : The song is a piano-based classical piece. It has a gentle, melancholy mood, with a soothing, slow pace. “Don’t Stop Me Now” by Queen : The energetic rock song builds on a piano, bass guitar, and drums. The singers are excited, ready to go, and uplifting. “Strawberry Swing” by Coldplay : The alternative/indie song contains influences from afro-pop and highlife music, and is built around finger-picked, distortion-free guitars with a heavy bassline and psychedelic synths. It’s a mid-tempo track, featuring echoing guitars, piano ballad-inspired melodies and bittersweet, anthemic falsetto vocals. “Mad World” by Gary Jules : The new wave / synth-pop song is backing its male singer with only a set of piano chords, a mellotron imitating a cello, very light touches of electric piano, and modest use of a vocoder on the chorus. “A Change is Gonna Come” by Sam Cooke : The soul / R&B song has a reflective, nostalgic mood. The male singer’s voice is clearly in the foreground, backed by horns, strings, and the timpani carrying the bridge. The French horn conveys a sense of melancholy. “Smells Like Teen Spirit” by Nirvana : The alternative rock/hard rock song has quiet verses with wobbly, chorused guitar, followed by big, loud hardcore-inspired choruses. The overall mood is rebellious and pumped up. The guitar chords are double tracked to create a more powerful sound. “Strawberry Fields Forever” by The Beatles : The psychadelic rock song features a reverse-recorded instrumentation, Mellotron flute sounds, an Indian swarmandal, tape loops and a fade-out/fade-in coda, as well as a cello and brass arrangement. The vocals are slightly dissonant adding a bittersweet and ominous quality. {title} by {artist} : D. AudioSet music labels AudioSet labels are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. /m/0z9c,/m/0mkg,/m/042v gx,/m/0fd3y,/t/dd00036,/m/025td0t,/m/0192l,/m/018j2,/m/0bm02,/m/018vs,/m/02cz 7,/m/0395lw, /m/0gg8l,/m/0155w,/m/0l14 3,/m/01kcd,/m/015vgc,/m/01xqw,/m/02bk07,/m/0l14jd,/m/02mscn,/m/0140xf,/m/01wy6,/m/0ggq0m, /m/01lyv,/m/0239kh,/m/01qbl,/m/0ggx5q,/m/02bxd,/m/026z9,/m/02fsn,/m/0283d,/m/02hnl,/m/02k mr,/m/026t6,/m/07s72n, /m/02sgy,/m/08cyft,/m/02lkt,/m/03xq f,/m/0m0jc,/t/dd00035,/m/0326g,/m/0l14j ,/m/02w4v,/m/0319l,/m/02x8m,/t/dd00032, /m/0dwtp,/m/0mbct,/m/0dls3,/m/0342h,/m/03gvt,/t/dd00031,/m/03qjg,/m/03m5k,/m/03q5t,/m/03lty,/m/0glt670,/m/03mb9, /m/05rwpb,/m/03 d0,/m/03r5q ,/m/05148p4,/m/07pkxdp,/m/0j45pbj,/m/04rzd,/m/0dwsp,/m/06j64v,/m/05fw6t,/m/0164x2, /m/028sqc,/m/0dq0md,/m/0g293,/m/02v2lh,/m/05pd6,/m/013y1f,/m/0l14md,/m/05r5c,/m/0fx80y,/m/064t9,/m/0dl5d,/m/05w3f, /m/05r6t,/m/05r5wn,/m/06cqb,/m/06j6l,/m/03t3fj,/m/07sbbz2,/m/06by7,/t/dd00033,/m/0ln16,/m/06ncr,/t/dd00037, /m/01hgjl,/m/0l14l2,/m/0l14t7,/m/0jtg0,/m/06rqw,/m/06rvn,/m/0gywn,/m/0l14gg,/m/06w87,/m/0l156b,/m/02qmj0d, /m/07s0s5r,/m/015y n,/m/0l14qv,/m/01p970,/m/07brj,/m/01glhc,/m/07gxw,/t/dd00034,/m/02cjck,/m/07kc ,/m/011k j, /m/02p0sh1,/m/07lnk,/m/07c6l,/m/07gql,/m/016622,/m/07xzm,/m/0dwt5,/m/01z7dr,/m/07y 7,/m/0y4f8,/m/04wptg,/m/085jw, /m/01sm1g,/m/01bns

这篇论文的主要贡献

Cursor

这篇论文《Noise2Music: Text-conditioned Music Generation with Diffusion Models》的主要贡献可以总结为以下几点：

1. 提出了Noise2Music系统：基于扩散模型的文本条件音乐生成

该系统能够根据文本提示（text prompt）生成高质量、长达30秒的音乐片段。
采用了级联扩散模型（cascaded diffusion models），分阶段生成音乐，提升了音频质量和可控性。

2. 创新的两阶段生成流程

生成器模型（Generator Model）：首先根据文本提示生成中间表示（如低保真音频或log-mel谱图）。
级联模型（Cascader Model）：再根据中间表示（和可选的文本）生成高保真音频。
最后通过超分辨率模型将音频提升到24kHz高质量。

3. 多种中间表示的探索

探索了两种中间表示方式：低保真音频和log-mel谱图，并比较了各自的优缺点。

4. 大规模高质量训练数据的自动构建

利用大语言模型（如LaMDA）和音乐-文本联合嵌入模型（MuLan），自动为大量音乐片段生成高质量、细粒度的文本描述（pseudo-labeling），构建了约15万小时的音乐-文本对数据集。
还发布了MuLaMCap数据集，包含40万对音乐-文本对，丰富了音乐生成和理解领域的数据资源。

5. 模型架构与技术创新

采用1D Efficient U-Net作为扩散模型的主干，适配音频序列建模。
使用cross-attention机制，将文本提示信息有效注入到扩散模型中。
引入classifier-free guidance提升文本与生成音乐的语义对齐。

6. 系统性评测与对比

采用Frechet Audio Distance (FAD) 和 MuLan相似度分数等指标，系统评测了生成音乐的音质和文本-音频对齐度。
与Riffusion、Mubert、MusicLM等主流方法进行了对比，展示了Noise2Music的优越性。
进行了大规模的人工听感测试，进一步验证了模型的表现。

7. 对音乐生成领域的推动

证明了扩散模型不仅能控制音乐的基本属性（如风格、乐器、节奏），还能捕捉更细致的语义（如氛围、情感、活动等）。
为后续音乐生成、音乐检索、音乐标注等任务提供了新的方法和数据基础。

总结一句话

Noise2Music首次将级联扩散模型大规模应用于文本条件音乐生成，提出了高效的数据自动标注方法，显著提升了生成音乐的质量和文本对齐度，并为社区贡献了大规模音乐-文本数据集和系统性评测。

如需更详细的技术细节或某一部分的深入解读，欢迎继续提问！

User

训练需要多少算力？

Cursor

论文中没有直接给出训练Noise2Music所需的总算力（如GPU/TPU数量、训练总时长等），但根据文中细节和类似大模型的经验，可以做出较为合理的推测和分析：

论文中相关信息

模型规模
- 主体模型参数量在数亿级别（如Waveform generator 7.2亿参数，Spectrogram generator 7.4亿参数）。
数据规模
- 训练数据约15万小时的音乐音频，自动标注文本描述。
- 音频采样率最高24kHz，片段长度30秒。
训练细节
- 批量大小（batch size）：2048（super-resolution模型为4096）。
- 优化器：Adam，cosine学习率调度，训练步数最高250万步。
- 训练和推理均在Google Cloud TPU v4上完成，推理时每次用4块TPU v4并行。

经验推算

数据量巨大：15万小时音频，假设每个样本30秒，约1800万个样本。
大batch + 高采样率 + 长序列：对内存和带宽要求极高。
模型参数量：单模型7亿参数，且有多个模型（生成器、cascader、vocoder等）。

参考类似项目

MusicLM（Google，2023）：训练用到数百到上千块TPU v4，训练时间数周。
Imagen/Stable Diffusion等大模型：通常需要数十到上百块A100/TPU v4，训练时间1-2周。

粗略估算

单模型训练：如果用8-16块TPU v4训练一个7亿参数的扩散模型，batch size 2048，训练250万步，通常需要几天到一周。
全流程训练（多模型+大数据）：实际需要几十到上百块TPU v4，训练时间可能在1-2周甚至更久。
存储需求：原始音频+中间表示+模型权重，至少数十TB。

论文原文相关描述

We serve the models on Google Cloud TPU V4, where each service request generates four 30-second music clips. We apply GSPMD (Xu et al., 2021) to partition the model on four TPU V4 devices, reducing the serving time by more than 50%.

总结

训练Noise2Music需要的算力非常大，大致量级为几十到上百块TPU v4（或等效A100 GPU），训练时间1-2周甚至更久。
具体消耗取决于实际并行度、数据预处理、模型调优等细节。
这类大模型训练通常只有大公司/研究机构具备条件完成。

如需更详细的推算公式或想了解小规模复现的建议，也可以继续问我！

User

数据集是什么？

Cursor

Noise2Music 论文中用到的数据集主要包括自建的大规模音乐-文本对数据集，以及用于评测的公开数据集。下面详细说明：

1. 训练数据集（自建）

数据来源

原始音频：收集了约 6,800,000 首音乐音频文件（约 15 万小时），每首音频切分为多个 30 秒片段。
音频采样率：16kHz（主模型），24kHz（超分辨率模型）。

文本标签（描述）如何获得？

由于公开的音乐-文本对数据稀缺，作者采用了**自动伪标签（pseudo-labeling）**方法：
1. 大语言模型（LaMDA）：生成大量音乐描述性文本（如风格、情感、乐器、场景等）。
2. MuLan（音乐-文本联合嵌入模型）：将每个音频片段与候选文本描述进行匹配，选出最相关的描述作为伪标签。
3. 补充标签：还利用了歌曲元数据（如歌名、艺术家、流派等）和人工标注的短标签。

最终数据集规模

约 15 万小时的音乐音频，每个片段配有多个高质量文本描述。
论文还构建了一个名为 MuLaMCap 的音乐-文本对数据集，包含约 40 万对音乐-文本对（基于 AudioSet 音乐子集自动标注）。

2. 评测数据集（公开）

Noise2Music 在以下公开数据集上进行评测：

MusicCaps
- 由 Google 发布，包含 5,500 个 10 秒音乐片段，每个片段配有人工写的详细文本描述。
AudioSet-Music-Eval
- AudioSet 的音乐子集，包含 1,482 个 10 秒音乐片段，每个片段配有标签（如风格、乐器等）。
MagnaTagATune (MTAT)
- 包含 21,638 个音乐片段，每个片段配有多个音乐标签（如流派、情感、乐器等）。

3. MuLaMCap 数据集（论文贡献）

MuLaMCap 是作者基于 AudioSet 音乐子集自动标注得到的音乐-文本对数据集。
包含约 40 万对音乐片段和文本描述，文本描述来自大语言模型和人工标签，语义丰富、细粒度。

总结

训练数据：主要是自建的大规模音乐音频+自动生成的文本描述（伪标签），约 15 万小时，680 万首音频。
评测数据：MusicCaps、AudioSet-Music-Eval、MagnaTagATune。
贡献数据：MuLaMCap（40 万对音乐-文本对，公开可用）。

如需了解伪标签生成流程、数据预处理细节或想获取公开数据集链接，可以继续问我！

User

OpenMusic的现代化开源替代品？

Cursor

你问的“OpenMusic的现代化开源替代品”——这里的 OpenMusic 通常指的是由 IRCAM 开发的可视化音乐作曲环境，主要用于算法作曲、音乐理论实验、声音设计等。OpenMusic 以其图形化编程和灵活的音乐数据处理著称，但它的界面和技术栈较为老旧，且主要面向 Mac 平台。

如果你在寻找现代化、开源、跨平台的替代品，下面是一些主流选择，涵盖了可视化编程、算法作曲、音乐生成、MIDI/音频处理等方向：

1. Pure Data (Pd)

简介：极为流行的开源可视化音频编程环境，适合实时音频、MIDI、算法作曲、交互艺术等。
优点：跨平台（Windows/Mac/Linux）、社区活跃、可扩展性强。
网址：https://puredata.info/

2. OpenMusic++ (OM#)

简介：OpenMusic 的现代化重写版，目标是成为跨平台、现代化的 OpenMusic 替代品。
优点：支持 Windows/Mac/Linux，界面更现代，兼容 OpenMusic 的 patch。
网址：https://github.com/openmusic-project/om#

3. ossia score

简介：面向实时交互和多媒体场景的开源创作环境，支持音频、MIDI、OSC、DMX 等多种协议。
优点：现代化 UI，跨平台，适合复杂的交互式音乐/多媒体作品。
网址：https://ossia.io/score/

4. SuperCollider

简介：强大的音频合成和算法作曲平台，虽然主要是代码编程，但有丰富的社区和可视化工具（如 tidalcycles.org）。
优点：极强的实时音频处理能力，跨平台，社区活跃。
网址：https://supercollider.github.io/

5. NodeBox

简介：面向创意编码的可视化编程环境，支持音乐、图形、动画等生成。
优点：现代化、易用，适合艺术家和设计师。
网址：https://www.nodebox.net/

6. LilyPond + Python/Music21

简介：LilyPond 用于乐谱排版，Music21 是 MIT 开发的音乐分析与生成 Python 库，适合算法作曲和音乐学研究。
优点：完全开源，Python 生态，适合批量处理和自动化。
网址：http://lilypond.org/, https://web.mit.edu/music21/

7. Orca

简介：极简、创意型的文本式音乐编程环境，适合 live coding 和实验音乐。
优点：极简、跨平台、社区活跃。
网址：https://github.com/hundredrabbits/Orca

8. INScore

简介：专注于交互式乐谱和多媒体展示的开源平台，适合实验性音乐和可视化。
网址：https://inscore.sourceforge.io/

总结

如果你想要和 OpenMusic 最接近的体验，推荐关注 OpenMusic++ (OM#)。
如果更注重实时音频和交互，Pure Data 和 ossia score 是很好的选择。
如果喜欢代码和算法作曲，SuperCollider、Music21、Orca 都值得尝试。

如需某个工具的详细介绍、对比、或入门教程，欢迎继续提问！

User

Skip to main content What type of model architecture is best suited for generative music? [D] : r/MachineLearning

r/MachineLearning Current search is within r/MachineLearning

Remove r/MachineLearning filter and expand search to all of Reddit Search in r/MachineLearning Advertise on Reddit

Open chat Create Create post Open inbox

User Avatar Expand user menu Skip to NavigationSkip to Right Sidebar

Back r/MachineLearning icon Go to MachineLearning r/MachineLearning • 10 mo. ago redditTee123

What type of model architecture is best suited for generative music? [D] Discussion In a previous life before $$$ became a necessity & I started software, I was a musician. I played guitar, piano, drums, have a solid understanding of music theory, and produced a few of my own tracks.

Not that much of that likely matters for building my own generative music model, but it definitely means my interest is there :)

Would anyone have recs where to start, specifically what type of model architecture? GPT is suggesting RNNs, Transformers, or VAEs. Has anyone experimented with various models & could give some guidance on architecture selection + how to process data appropriately?

Upvote 20

Downvote

7 Go to comments

Share Share Join the conversation Sort by:

Best

Search Comments Expand comment search Comments Section not_particulary • 10mo ago Start with a transformer, maybe on midi files?

Upvote 8

Downvote

Reply reply

Award

Share Share

mvreich • 10mo ago Look into diffusion models and audio diffusion. Just off the top off my head, it should be very easy to train a text/lyric-to-music generative model following the generalized approach in Rombach’s LDM paper. KL-regularized audio autoencoder + CLAP prompt embedding + transformer-based diffusion model.

Upvote 5

Downvote

Reply reply

Award

Share Share

[deleted] • 10mo ago u/AsianMusicProgrammer avatar AsianMusicProgrammer • 10mo ago I’d recommend checking out MidiTok for getting started with Transformed-based approaches

Upvote 4

Downvote

Reply reply

Award

Share Share

u/veganshakzuka avatar veganshakzuka • 10mo ago I’d look into Jamba.

Upvote 2

Downvote

Reply reply

Award

Share Share

Karioth1 • 10mo ago Hardest part will be securing enough compute to train. But I would go on Google scholar and just search “music generation transformer” and see what’s there. I would probably go for a symbolic approach train on midi-like tokens. But there are audio to audio models as well.

Upvote 2

Downvote

Reply reply

Award

Share Share

Community Info Section r/MachineLearning Join Machine Learning Beginners -> /r/mlquestions or /r/learnmachinelearning , AGI -> /r/singularity, career advices -> /r/cscareerquestions, datasets -> r/datasets

Show more Created Jul 29, 2009 Public 3M Members 245 Online Top 1% Rank by size Rules 1 No Spam 2 No Self-Promotion 3 No Marketing Campaigns (SEO) 4 No Disrespectful Behavior 5 No arXiv Links without Body Text 6 No Low-Effort, Beginner Questions Recommended Subreddits r/MLQuestions icon r/MLQuestions 78,080 members r/learnmachinelearning 524,379 members r/reinforcementlearning 62,154 members r/deeplearning 196,806 members r/ArtificialInteligence icon r/ArtificialInteligence 1,497,748 members r/artificial icon r/artificial 1,102,093 members r/computervision icon r/computervision 118,778 members r/Automate 143,899 members r/OpenAI icon r/OpenAI 2,378,821 members r/Anthropic icon r/Anthropic 22,511 members Moderators Message Mods u/kunjaan avatar u/kunjaan u/cavedave Mod to the stars u/MTGTraner avatar u/MTGTraner HD Hlynsson u/hardmaru avatar u/hardmaru Google Brain u/programmerChilli avatar u/programmerChilli Researcher u/AutoModerator avatar u/AutoModerator u/olaf_nij avatar u/olaf_nij u/BeatLeJuce Researcher View all moderators Reddit Rules Privacy Policy User Agreement Reddit, Inc. © 2025. All rights reserved.

Collapse Navigation

讨论了什么？

Cursor

你问的这个 Reddit 讨论帖主要围绕**“哪种模型架构最适合生成音乐（generative music）”**展开，下面是讨论的主要内容总结：

1. 主流建议：Transformer

多位用户建议从 Transformer 架构入手，尤其是在 MIDI（符号音乐）数据上训练。
Transformer 在序列建模（如文本、MIDI音符序列）方面表现优异，已成为音乐生成领域的主流选择。

2. 扩散模型（Diffusion Models）

有人推荐关注扩散模型（diffusion models）和音频扩散（audio diffusion），尤其是结合自编码器（autoencoder）和文本/音频联合嵌入（如 CLAP）。
这种方法适合做文本到音乐的生成（text-to-music），并且可以借鉴图像领域的 latent diffusion model（LDM）思路。

3. 其他架构

也有人提到RNN、VAE等传统架构，但普遍认为 Transformer 和 Diffusion 更现代、更有效。

4. 数据处理建议

多人建议先用 MIDI 或类似的符号数据，因为音频直接建模难度和算力需求更高。
推荐工具如MidiTok，用于处理和编码 MIDI 数据，方便 Transformer 等模型训练。

5. 算力与数据

有人提醒训练大模型需要大量算力，建议先查阅相关论文，了解当前主流做法和数据集。

6. 其他资源

有人提到可以查阅 Google Scholar 上关于“music generation transformer”的论文，获取最新进展。
也有用户推荐关注 Jamba（一个音乐生成相关项目）。

总结一句话

Transformer 是当前音乐生成领域的主流架构，适合处理 MIDI 等符号数据。
扩散模型（Diffusion）在音频生成和文本到音乐方向也很有前景。
数据处理和算力是实际落地时的重要考虑因素。

如需具体模型、代码或论文推荐，可以继续问我！

User

Jamba: A Hybrid Transformer-Mamba Language Model Opher Lieber∗ Barak Lenz∗ Hofit Bata Gal Cohen Jhonathan Osin Itay Dalmedigos Erez Safahi Shaked Meirom Yonatan Belinkov Shai Shalev-Shwartz Omri Abend Raz Alon Tomer Asida Amir Bergman Roman Glozman Michael Gokhman Avshalom Manevich Nir Ratner Noam Rozen Erez Schwartz Mor Zusman Yoav Shoham Abstract We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license. Model: https://huggingface.co/ai21labs/Jamba-v0.1 1 Introduction We introduce Jamba, a new publicly available large language model. Jamba is based on a novel hybrid architecture, which combines Transformer layers [51] with Mamba layers [17], a recent state-space model [18, 19], as well as a mixture-of-experts (MoE) module [14, 46]. Jamba thus combines two orthogonal architectural designs that together give it improved performance and higher throughput, while maintaining a manageable memory footprint. The 7B-based Jamba model (12B active parameters, 52B total available parameters) we are releasing was designed to fit in a single 80GB GPU, but the Jamba architecture supports other design choices, depending on one’s hardware and performance requirements. ∗Equal contribution. arXiv:2403.19887v2 [cs.CL] 3 Jul 2024 The fundamental novelty of Jamba is its hybrid Transformer-Mamba architecture (though see mention below of recent related efforts). Despite the immense popularity of the Transformer as the predominant architecture for language models, it suffers from two main drawbacks. First, its high memory and compute requirements hinders the processing of long contexts, where the key-value (KV) cache size becomes a limiting factor. Second, its lack of a single summary state entails slow inference and low throughput, since each generated token performs a computation on the entire context. In contrast, older recurrent neural network (RNN) models, which summarize an arbitrarily long context in a single hidden state, do not suffer from these limitations. RNN models have their own shortcomings, however. They are costly to train since training cannot be parallelized across time steps. And they struggle with long distance relationships, which the hidden state captures to only a limited extent. Recent state space models (SSMs) like Mamba are more efficient to train than RNNs and are more capable at handling long distance relationships, but still lag behind the performance of comparably sized Transformer language models. Taking advantage of both model families, Jamba combines Transformer and Mamba layers, at a certain ratio. Varying the ratio of Transformer/Mamba layers allows balancing memory usage, efficient training, and long context capabilities. A few other recent attempts to combine Attention and SSM modules are worth noting. [55] mixes an S4 layer [18] with a local attention layer, followed by a sequence of local attention layers; it shows experiments with small models and simple tasks. [17] reports that interleaving Mamba and attention layers is only slightly better than pure Mamba in terms of perplexity, with models up to 1.3B parameters. [37] starts with an SSM layer followed by chunk-based Transformers, with models up to 1.3B showing improved perplexity. [13] adds an SSM layer before the self-attention in a Transformer layer, while [43] adds the SSM after the self-attention, both showing improvements on speech recognition. [36] replaces the MLP layers in the Transformer by Mamba layers, and shows benefits in simple tasks. These efforts are different from Jamba both in the particular way in which the SSM component is mixed with the attention one, and in the scale of implementation. Closest are perhaps H3 [15], a specially designed SSM that enables induction capabilities, and a generalization called Hyena [39]. The former proposed a hybrid architecture that replaces the second and middle layers with self-attention, and was implemented with up to 2.7B parameters and 400B training tokens. However, as shown in [17], its perfomance lags that of pure Mamba. Based on Hyena, StripedHyena [40] interleaves attention and SSM layers in a 7B parameter model. However, it lags behind the Attention-only Mistral-7B [23]. All of this renders Jamba the first production-grade Attention-SSM hybrid model. Scaling the hybrid Jamba architecture required overcoming several obstacles, which we dicsuss in Section 6. Jamba also includes MoE layers [14, 46], which allow increasing the model capacity (total number of available parameters) without increasing compute requirements (number of active parameters). MoE is a flexible approach that enables training extremely large models with strong performance [24]. In Jamba, MoE is applied to some of the MLP layers. The more MoE layers, and the more experts in each MoE layer, the larger the total number of model parameters. In contrast, the more experts we use at each forward pass, the larger the number of active parameters as well as the compute requirement. In our implementation of Jamba, we apply MoE at every other layer, with 16 experts and the top-2 experts used at each token (a more detailed discussion of the model architecture is provided below). We evaluated our implementation of Jamba on a wide range of benchmarks and found it performs comparably to Mixtral-8x7B [24], which has a similar number of parameters, and also to the larger Llama-2 70B [50]. In addition, our model supports a context length of 256K tokens – the longest supported context length for production-grade publicly available models. On long-context evaluations, Jamba outperformes Mixtral on most of the evaluated datasets. At the same time, Jamba is extremely efficient; for example, its throughput is 3x that of Mixtral-8x7B for long contexts. Moreover, our model fits in a single GPU (with 8bit weights) even with contexts of over 128K tokens, which is impossible with similar-size attention-only models such as Mixtral-8x7B. Somewhat unusual for a new architecture, we release Jamba (12B active parameters, 52B total available parameters) under Apache 2.0 license: https://huggingface.co/ai21labs/Jamba-v0.1. We do so since we feel that the novel architecture of Jamba calls for further study, experimentation, and optimization by the community. Our design was based on various ablation experiments we conducted to explore the effect of different tradeoffs and design choices, and insights gleaned from those. These ablations were performed at scales of up to 7B parameters, and training runs of up to 250B tokens. We plan to release model checkpoints from these runs. 2 RMSNorm Attention RMSNorm MLP RMSNorm Attention RMSNorm MoE RMSNorm Mamba RMSNorm MLP RMSNorm Mamba RMSNorm MoE Transformer layer Attention MoE layer Mamba layer Mamba MoE layer (a) Jamba block (b) Different types of layers Mamba layer Mamba layer Mamba MoE layer Transformer layer Mamba layer Mamba MoE layer Mamba MoE layer Mamba MoE layer Figure 1: (a) A single Jamba block. (b) Different types of layers. The implementation shown here is with l = 8, a : m = 1 : 7 ratio of attention-to-Mamba layers, and MoE applied every e = 2 layers. Important notice: The Jamba model released is a pretrained base model, which did not go through alignment or instruction tuning, and does not have moderation mechanisms. It should not be used in production environments or with end users without additional adaptation. 2 Model Architecture Jamba is a hybrid decoder architecture that mixes Transformer layers [51] with Mamba layers [17], a recent state-space model (SSM) [18, 19], in addition to a mixture-of-experts (MoE) module [14, 46]. We call the combination of these three elements a Jamba block. See Figure 1 for an illustration. Combining Transformer, Mamba, and MoE elements allows flexibility in balancing among the sometimes conflicting objectives of low memory usage, high throughput, and high quality. In terms of memory usage, note that comparing the total size of the model parameters can be misleading. In an MoE model, the number of active parameters that participate in any given forward step may be much smaller than the total number of parameters. Another important consideration is the KV cache – the memory required to store the attention keys and values in the context. When scaling Transformer models to long contexts, the KV cache becomes a limiting factor. Trading off attention layers for Mamba layers reduces the total size of the KV cache. Our architecture aims to provide 3 not only a small number of active parameters but also an 8x smaller KV cache compared to a vanilla Transformer. Table 1 compares Jamba with recent publicly available models, showing its advantage in maintaining a small KV cache even with 256K token contexts. Available params Active params KV cache (256K context, 16bit) LLAMA-2 6.7B 6.7B 128GB Mistral 7.2B 7.2B 32GB Mixtral 46.7B 12.9B 32GB Jamba 52B 12B 4GB Table 1: Comparison of Jamba and recent open models in terms of total available parameters, active parameters, and KV cache memory on long contexts. Jamba provides a substantial reduction in the KV cache memory requirements. In terms of throughput, with short sequences, attention operations take up a small fraction of the inference and training FLOPS [7]. However, with long sequences, attention hogs most of the compute. In contrast, Mamba layers are more compute-efficient. Thus, increasing the ratio of Mamba layers improves throughput especially for long sequences. Here is a description of the main configuration, which provides improved performance and efficiency. Section 6 contains results from ablation experiments supporting the design choices. The basic component is a Jamba block, which may be repeated in sequence. Each Jamba block is a combination of Mamba or Attention layers. Each such layer contains either an attention or a Mamba module, followed by a multi-layer perceptron (MLP). The different possible types of layers are shown in Figure 1(b).2 A Jamba block contains l layers, which are mixed at a ratio of a : m, meaning a attention layers for every m Mamba layers. In Jamba, some of the MLPs may be replaced by MoE layers, which helps increase the model capacity while keeping the active number of parameters, and thus the compute, small. The MoE module may be applied to MLPs every e layers. When using MoE, there are n possible experts per layer, with a router choosing the top K experts at each token. In summary, the different degrees of freedom in the Jamba architecture are: • l: The number of layers. • a : m: ratio of attention-to-Mamba layers. • e: how often to use MoE instead of a single MLP. • n: total number of experts per layer. • K: number of top experts used at each token. Given this design space, Jamba provides flexibility in preferring certain properties over others. For example, increasing m and decreasing a, that is, increasing the ratio of Mamba layers at the expense of attention layers, reduces the required memory for storing the key-value cache. This reduces the overall memory footprint, which is especially important for processing long sequences. Increasing the ratio of Mamba layers also improves throughput, especially at long sequences. However, decreasing a might lower the model’s capabilities. Additionally, balancing n, K, and e affects the relationship between active parameters and total available parameters. A larger n increases the model capacity at the expense of memory footprint, while a larger K increases the active parameter usage and the compute requirement. In contrast, a larger e decreases the model capacity, while decreasing both compute (when K>1) and memory requirements, and allowing for less communication dependencies (decreasing memory transfers as well as inter-GPU communication during expert-parallel training and inference). Jamba’s implementation of Mamba layers incorporate several normalizations that help stabilize training in large model scales. In particular, we apply RMSNorm [53] in the Mamba layers. 2The figure shows a potential Attention MoE layer, which our architecture does not use, but future variants could. 4 We found that with the Mamba layer, positional embeddings or mechanisms like RoPE [47] are not necessary, and so we do not use any explicit positional information. Other architecture details are standard, including grouped-query attention (GQA), SwiGLU activation function [7, 45, 50], and load balancing for the MoE [14]. The vocabulary size is 64K. The tokenizer is trained with BPE [16, 33, 44] and each digit is a separate token [7]. We also remove the dummy space used in Llama and Mistral tokenizers for more consistent and reversible tokenization. 3 Reaping the Benefits 3.1 Jamba Implementation for a Single 80GB GPU The specific configuration in our implementation was chosen to fit in a single 80GB GPU, while achieving best performance in the sense of quality and throughput. In our implementation we have a sequence of 4 Jamba blocks. Each Jamba block has the following configuration: • l = 8: The number of layers. • a : m = 1 : 7: ratio attention-to-Mamba layers. • e = 2: how often to use MoE instead of a single MLP. • n = 16: total number of experts. • K = 2: number of top experts used at each token. The a : m = 1 : 7 ratio was chosen according to preliminary ablations, as shown in Section 6, since this ratio was the most compute-efficient variant amongst the best performing variants in terms of quality. The configuration of the experts was chosen to enable the model to fit in a single 80GB GPU (with int8 weights), while including sufficient memory for the inputs. In particular, n and e were balanced to have an average of ∼8 experts per layer. In addition, we balanced n, K, and e to allow for high quality, while keeping both compute requirements and communication dependencies (memory transfers) checked. Accordingly, we chose to replace the MLP module with MoE on every other layer, as well as have a total of 16 experts, two of which are used at each token. These choices were inspired by prior work on MoE [8, 54] and verified in preliminary experiments. Figure 2 shows the maximal context length that fits a single 80GB GPU with our Jamba implementation compared to Mixtral 8x7B and Llama-2-70B. Jamba provides 2x the context length of Mixtral and 7x that of Llama-2-70B. Context Length 0K 20K 40K 60K 80K 100K 120K 140K Llama-2 70B Mixtral 8x7B Jamba Context length fitting a single 80GB A100 GPU Figure 2: Comparison of maximum context length fitting in a single A100 80GB GPU. Jamba enables 2x the context length of Mixtral and 7x that of Llama-2-70B. Overall, our Jamba implementation was successfully trained on context lengths of up to 1M tokens. The released model supports lengths of up to 256K tokens. 5 3.2 Throughput Analysis For concreteness, we present results of the throughput in two specific settings.3 In the first setting, we have varying batch size, a single A100 80 GB GPU, int8 quantization, 8K context length, generating output of 512 tokens. As Figure 3a shows, Jamba allows processing of large batches, leading to a 3x increase in throughput (tokens/second) over Mixtral, which does not fit with a batch of 16 despite having a similar number of active parameters. In the second setting, we have a single batch, 4 A100 GPUs, no quantization, varying context lengths, generating output of 512 tokens. As demonstrated in Figure 3b, at small context lengths all models have a similar throughput. Jamba excels at long contexts; with 128K tokens its throughput is 3x that of Mixtral. Note that this is despite the fact that Jamba has not yet enjoyed optimizations of the kind the community has developed for pure Transformer models over the past six years. We can expect the throughut gap to increase as such optimizations are developed also for Jamba. Batch size Throughput (t/s) 0 500 1000 1500 2000 1 2 4 8 16 Llama-2 13B Llama-2 70B Mixtral 8x7B Jamba Throughput (single GPU) (a) Throughput at different batch sizes (single A100 GPU, 8K context length). Jamba allows processing large batches, with a throughput 3x greater than Mixtral. Context Window Throughput (t/s) 0 500 1000 1500 2000 1K 2K 4K 8K 16K 32K 64K 128K Llama-2 13B Llama-2 70B Mixtral 8x7B Jamba Throughput (4 A100 GPUs) (b) Throughput at different context lengths (single batch, 4 A100 GPUs). With a context of 128K tokens, Jamba obtains 3x the throughput of Mixtral, while Llama-2-70B does not fit with this long context. Figure 3: Comparison of throughput (tokens/second) with Jamba and recent open models. 4 Training Infrastructure and Dataset The model was trained on NVIDIA H100 GPUs. We used an in-house proprietary framework allowing efficient large-scale training including FSDP, tensor parallelism, sequence parallelism, and expert parallelism. Jamba is trained on an in-house dataset that contains text data from the Web, books, and code, with the last update in March 2024. Our data processing pipeline includes quality filters and deduplication. 5 Evaluation In general we approach benchmarks cautiously, as they correlate only partially with what matters in real applications, and furthermore invite gaming the system in order to boast vanity numbers. Nevertheless, we present several indicative results. 5.1 Academic Benchmarks We report results with a wide range of standard academic benchmarks: Common sense reasoning: HellaSwag (10-shot) [52], WinoGrande (5-shot) [42], ARC-E (0-shot) and ARC-Challenge (25-shot) [10], and PIQA (zero-shot) [3]. Reading Comprehension: BoolQ (10-shots) [9] and QuAC (zero-shot) [6]. Others: GSM8K (3-shot CoT) [11], HumanEval (pass@1) [5], Natural Questions closed-book (NQ; 5-shot) [27], and TruthfulQA (zero-shot) [30]. Aggregate benchmarks: MMLU (5-shot) [21] and BBH (3-shot) [48]. 3Referring to end-to-end throughput (encoding+decoding). The results should be taken relatively rather than absolutely, as they are without possible optimizations. 6 Reasoning HellaSwag WinoGrande ARC-E ARC-C PIQA NQ TruthfulQA Llama-2 13B 80.7 72.8 77.3 59.4 80.5 37.7 37.4 Llama-2 70B 85.3 80.2 80.2 67.3 82.8 46.9 44.9 Gemma 81.2 72.3 81.5 53.2 81.2 32.6 44.8 Mixtral 86.7 81.2 77.6 66 83 44.8 46.8 Jamba 87.1 82.5 73.5 64.4 83.2 45.9 46.4 Comprehension Aggregate BoolQ QuAC GSM8K HumanEval MMLU BBH Llama-2 13B 81.7 42.7 34.7 18.3 54.8 39.4 Llama-2 70B 85 42.4 55.3 29.9 69.8 51.2 Gemma 87.2 39.2 54.5 32.3 64.3 55.1 Mixtral 88.4 40.9 60.4 34.8 70.6 50.3 Jamba 88.2 40.9 59.9 29.3 67.4 45.4 Table 2: Comparison of Jamba with other publicly available models. Jamba obtains similar performance with much better throughput. Table 2 compares Jamba to several publicly available models on common academic benchmarks for evaluating language models. We compare with Llama-2 13B [50], which has about the same number of active paramters as our model, Llama-2 70B, which is larger than our model, Gemma [49], which has 7B parameters, and Mixtral [24], which has about the same number of active and total parameters as our model. In most tasks, Jamba performs comparably to leading publicly available models of similar or larger size, including Llama-2 70B and Mixtral. At the same time, our model has a smaller number of total available parameters than Llama-2 (52B compared to 70B). Moreover, as a sparse model, Jamba has only 12B active parameters, similar to Mixtral’s 12.9B active parameters. However, as a fullyattentional model, Mixtral has a large memory footprint with long sequences, requiring 32GB for the KV cache with 256K tokens. In contrast, thanks to its hybrid Attention-Mamba architecture, Jamba’s KV cache takes only 4GB even at such a long context (Section 2). Importantly, our Jamba achieves such a strong performance while having much better throughput than Llama-2 70B and Mixtral, up to 3x improvement (Section 3.2). In summary, Jamba demostrates the ability of hybrid architectures to reach the performance of state-of-the-art Transformer models of the same size class, while having the benefits of an SSM. 5.2 Long-Context Evaluations We have successfully trained Jamba models with context lengths of up to 1M tokens. The released model handles context lengths of up to 256K tokens. In this section, we evaluate it on synthetic and naturalistic benchmarks that test its long-context capabilities. 5.2.1 Needle-in-a-haystack As Figure 4 shows, Jamba has excellent performance in the needle-in-a-haystack evaluation, which requires retrieving a simple statement planted in a long context window [25]. This result is noteworthy especially given that our implementation of Jamba uses only 4 attention layers. 5.2.2 Naturalistic long-context evaluation We evaluate Jamba’s ability to handle long contexts in two settings. First, we evaluate the model on several classification tasks commonly used for assessing in-context learning, with an increasing number of few-shot examples. In particular, we use the four datasets with the largest label space from [41], which showed that such tasks benefit most from using more few-shot examples: Trec-Fine (fine-grained question type classification, 50 labels; [29]), NLU Intent (intent classification in natural language understanding, 68 labels; [31]), Banking77 (intent classification in the banking domain, 77 7 2K 4K 8K 32K 128K 256K Tokens 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Depth % Retrieval Across Context Lengths (“Needle-in-a-haystack”) 0.0 0.2 0.4 0.6 0.8 1.0 Score Figure 4: A needle-in-a-haystack evaluation showing Jamba’s ability to recall statements placed in the middle of contexts of up to 256K tokens length. labels; [4]), and CLINC150 (intent classification, 150 labels; [28]). In each case we add examples up to a context length of 128K tokens. Figure 5 shows the results, comparing Jamba to Mixtral. In Trec-Fine and Banking77, Jamba outperforms Mixtral, especially with a large number of few-shot examples. In NLU Intent and CLINC150, the two models are on par. 0 1000 2000 3000 4000

samples

0.0 0.2 0.4 0.6 0.8 1.0 greedy_em greedy_em vs. # samples for TREC-Fine Jamba Mixtral (a) TREC Fine-grained. 0 1000 2000 3000 4000 5000

samples

0.0 0.2 0.4 0.6 0.8 1.0 greedy_em greedy_em vs. # samples for NLU Intent Jamba Mixtral (b) NLU Intent. 0 500 1000 1500 2000 2500 3000 3500

samples

0.0 0.2 0.4 0.6 0.8 1.0 greedy_em greedy_em vs. # samples for Banking77 Jamba Mixtral (c) Banking77. 0 1000 2000 3000 4000

samples

0.0 0.2 0.4 0.6 0.8 1.0 greedy_em greedy_em vs. # samples for CLINC150 Jamba Mixtral (d) CLINC150. Figure 5: Comparison of Jamba and Mixtral on few-shot classification with a large number of few-shot examples. Results are exact match with greedy decoding. Second, we use question-answering datasets, consisting of long inputs. To this end, we repurpose five of the longest-context datasets from L-Eval [2], by structuring them in a few-shot format (we use 3-shots in these experiments). Specifically, we evaluated the models on the following datasets: NarrativeQA (QA on narratives; [26]), LongFQA (finance; [2]), Natural Questions (NQ; Wikipedia; 8 [27]), CUAD (law; [22]), and SFiction (science fiction). The average input length in these datasets ranges from 6K to 62K tokens. These lengths are further highly expanded by the few-shot format. Table 3 summarizes the results, in terms of F1.4 Jamba outperforms Mixtral on most of the datasets as well as on average. In addition, as these long-context tasks require substantial computation, here Jamba’s efficiency shines, with much better throughput with long contexts (Section 3.2). LongFQA CUAD NarrativeQA NQ SFiction Avg Mixtral 0.42 0.46 0.29 0.58 0.42 0.43 Jamba 0.44 0.44 0.30 0.60 0.40 0.44 Table 3: Results (F1) on long-context QA benchmarks, with a 3-shot format. 6 Ablations and Insights This section discusses ablation experiments we ran for different design choices in our implementation of the Jamba architecture. First we show the benefit of combining attention and Mamba layers, at which ratio they should be combined, and how to interleave them. We investigate cases where pure Mamba fails, suggesting that it struggles to develop in-context learning capabilities, while the Attention-Mamba hybrid exhibits in-context learning similar to vanilla Transformers. Then we show the benefit of adding MoE on top of a hybrid Attention-Mamba model. Finally, we share two additional learnings that we found useful: explicit positional information is not needed in Jamba, and Mamba layers necessitate special normalization to stabilize training at large scale.5 For these ablations, we report the following measures, which exhibit informative performance even at small data or model scale. • Academic benchmarks: HellaSwag (10-shot) [52], WinoGrande (5-shot) [42], Natural Questions closed-book (NQ; 5-shot) [27]. • HuggingFace OpenLLM leaderboard (OLLM) [12]: a summary statistic of several datasets. We report results with our reproduction. • Perplexity evaluations: we report log-prob (per byte) on texts from three domains: C4, Books, and code. 6.1 Benefits of combining Attention and Mamba We first investigate the ratio of Attention to Mamba layers (a : m), with 1.3B parameters models trained for 250B tokens. As Table 4 shows, the hybrid Jamba model outperforms the pure attention or Mamba models. The ratio of attention-to-Mamba layers may be 1:3 or 1:7 with virtually no performance difference. Figure 6 shows the training loss of these models, where Jamba exhibits improved loss during training. Given that a 1:7 ratio is more compute-efficient and shows similar performance, we opt for it in our larger-scale experiments. Hella Swag Wino Grande log-prob OLLM NQ C4 Books Code Attention 36.4 62.4 59.6 14.5 -0.543 -0.659 -0.331 Mamba 36.1 62.6 59.4 14.5 -0.543 -0.661 -0.334 Jamba (a : m = 1 : 3, no MoE) 37.2 65.1 61.7 16.5 -0.533 -0.649 -0.321 Jamba (a : m = 1 : 7, no MoE) 37.2 65.1 61.7 16.0 -0.533 -0.650 -0.321 Table 4: Results on academic benchmarks and log-probability evaluations showing an improved performance of Attention-Mamba (no MoE) compared to vanilla Attention and Mamba models. There is no substantial difference between 1:3 and 1:7 ratios of Attention-to-Mamba layers. Models are 1.3B parameters, trained for 250B tokens. 4 F1 score is the recommended metric in L-Eval [2]. In addition, our setup calibrates the length of the few-shot completions to approximately match the length of the test completion, increasing the credibility of F1. 5 In all the ablations, “pure Mamba” refers to models with Mamba layers interleaved with MLP layers. 9 Figure 6: Training loss curves for pure Attention, pure Mamba, and Attention-Mamba hybrids (no MoE), with ratios a : m of 1:3 and 1:7. All models are 1.3B parameters. The two hybrids achieve better loss throughout this training run, without any noticeable difference between the different Attention/Mamba ratios. Next, we compare performance of vanilla Transformer, vanilla Mamba, and Attention-Mamba hybrid models, at 7B model size, after training on 50B tokens. As Table 5 shows, the pure Mamba model is quite competitive, but lags slightly behind pure Attention. The hybrid Attention-Mamba (without MoE) outperforms the pure models while obtaining better throughput than the vanilla Transformer (Section 3.2). Hella Swag Wino Grande log-prob OLLM NQ C4 Books Code Attention 36.1 60.4 59.7 13.7 -0.555 -0.666 -0.347 Mamba 35.3 60.2 55.8 14.0 -0.554 -0.667 -0.355 Jamba (a : m = 1 : 7, no MoE) 36.6 62.5 58.8 15.4 -0.547 -0.658 -0.340 Table 5: Results on academic benchmarks and log-prob evaluations, comparing pure Attention, pure Mamba, and Attention-Mamba hybrid (no MoE). Models are 7B parameters, trained for 50B tokens. Figure 7 shows the training loss of the three architectures. While the pure Transformer and Mamba models have a similar convergence, the hybrid Jamba (no MoE) has a lower loss throughout this run. Figure 7: Training loss curves for pure Attention, pure Mamba, and an Attention-Mamba hybrid (no MoE). All models are 7B parameters. The hybrid achieves better loss throughout this training run. 10 6.2 Why does the Combination Work? The pure Mamba model showed fairly good results in most tasks early on, including in general perplexity evaluations. However, it performed substantially worse than the pure Attention model in three common benchmark tasks: IMDB [32], QuAC [6], and NarrativeQA [26]. In contrast, the hybrid Attention-Mamba performed similarly to the Attention model on these datasets. Table 6 shows the results for 1.3B models after 250B tokens. IMDB QuAC NarrativeQA Attention 84.1 27.9 45.8 Mamba 48.8 20.2 27.7 Attention-Mamba 90.9 26.6 43.7 Table 6: Mamba performs poorly on certain datasets, while the Attention-Mamba hybrid performs on par with the Attention model. Looking into these results further, we found out that the pure Mamba model often does not follow the correct format. For instance, in the IMDB dataset, answer choices are “Positive” or “Negative”. While the Attention model adheres to this format, the pure Mamba model often produces other answers, such as “Very Good”, “Very Positive”, “Funny”, “Bad”, “Poor”, and “3/10”. While these may be considered correct answers, the difficulty of Mamba to adhere to the format suggests a potential problem. Indeed, to perform successful in-context learning, it is important for models to capture the input-output format [34]. The hybrid Attention-Mamba model follows the format successfully, just like the pure Attention model. We hypothesize that this phenomenon points to a limitation of SSMs – a potential difficulty in in-context learning (ICL). Indeed, the ability to perform ICL has been linked to the emergence of socalled induction heads in Transformer language models during training, which perform approximate copying operations that are supportive of ICL [35]. We conjecture that the lack of an attention mechanism in the pure Mamba model makes it difficult for it to learn in-context. While Mamba may learn to copy and perform simple ICL when explicitly trained to do so ([17, 36], it is not clear if ICL is an emergent capability in SSM as is typical of Transformer models. In contrast, the hybrid Attention–Mamba model does perform successful ICL, even when only 1 out of 8 layers is an Attention one. As anecdotal evidence of an emergent induction mechanism, we visualize in Figure 8 the attention of an example head from a 1.3B Attention-Mamba hybrid model (no MoE), on an IMDB example where the pure Mamba failed and the hybrid succeeded. Clearly, the attention from the last token (“:”) is focused on the labels from the few-shot examples. We have found 12 such heads in our hybrid model, in all three attention layers (which correspond to layers 4, 12, 20 in the model). Future work can further investigate the emergence of ICL in hybrid models at large scale. Our released checkpoints would hopefully facilitate such investigations. Finally, very recent work has attempted to extract attention-like scores from state-space models like Mamba [1], which opens another direction to search for induction capabilities in state-space models. 6.3 The Effect of Mixture-of-Experts (MoE) Recent work has shown that MoE improves Transformer language models while keeping compute manageable [24].6 However, it is not clear if MoE integrates well with state-space models at a large scale, and specifically with our hybrid Attention–Mamba architecture. Indeed, Table 7 shows that MoE improves the performance of the hybrid Attention-Mamba architecture at large scale (7B parameters trained on 50B tokens). The MoE variant has n = 16 total experts, K = 2 experts used at each token, and MoE is applied every e = 2 layers, as described in Section 3.1. 6There is also initial evidence that MoE helps Mamba layers, albeit at small model and data scale [38]. 11 […] […] […] […] Figure 8: Example induction head (H3, first attention layer) from a hybrid Attention-Mamba model. Highlighted words reflect strong attention from the last token, “:”, just before the model is about to predict the label. We see that the attention is focused on label tokens from the few-shot examples. Hella Swag Wino Grande log-prob OLLM NQ C4 Books Code Jamba (no MoE) 36.6 62.5 58.8 15.4 -0.547 -0.658 -0.340 Jamba+MoE 38.1 66.0 61.2 18.9 -0.534 -0.645 -0.326 Table 7: Mixture-of-experts improves the Attention-Mamba hybrid. 6.4 Stabilizing Mamba at large scale When training Jamba models of up to 1.3B parameters, we observed stable training without special problems. However, when scaling to the largest model released here (7B-based, which has 12B/52B active/total parameters), we encountered large loss spikes. Investigating this revealed that inner parts of the Mamba layers suffer from large activation values, leading to the spikes. We therefore added RMSNorm [53] to internal activations. As Figure 9 shows, this stabilized training and prevented additional loss spikes. 6.5 Jamba does not Require Explicit Positional Information Table 8 shows results of the Jamba architecture (with MoE) with no positional information and when applying RoPE [47] in the attention layers (1.3B parameter models, 250B tokens). The results are similar, suggesting that explicit positional information may not be required for the hybrid architecture. Presumably, the Mamba layers, which are placed before attention layers, provide implicit position information.7 7 Some prior evidence suggested that Transformer decoder models do not need positional encodings [20]. However, all existing large scale models do use some sort of explicit position information. 12 Figure 9: Adding RMSNorm to Mamba layers prevents loss spikes. Hella Swag Wino Grande Narrative QA log-prob OLLM ARC-C NQ BoolQ C4 Books Code Jamba 39.6 71.5 64.2 40.7 50.5 22.2 68.9 -0.516 -0.623 -0.299 Jamba+RoPE 40.1 71.8 65.5 40.4 46.2 22.2 67.9 -0.516 -0.623 -0.299 Table 8: Comparison of Jamba with and without explicit positional information. 7 Conclusion We presented Jamba, a novel architecture which combines Attention and Mamba layers, with MoE modules, and an open implementation of it, reaching state-of-the-art performance and supporting long contexts. We showed how Jamba provides flexibility for balancing performance and memory requirements, while maintaining a high throughput. We experimented with several design choices such as the ratio of Attention-to-Mamba layers and discussed some discoveries made during the development process, which will inform future work on hybrid attention–state-space models. To facilitate such research, we plan to release model checkpoints from smaller-scale training runs. The largest model we provide with this release has 12B active and 52B total available parameters, supporting context lengths of up to 256K tokens and fitting in a single 80GB GPU even when processing 140K-token texts. References [1] Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024. [2] Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023. [3] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. [4] Iñigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ˇ c. Efficient ´ intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, 2020. [5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 13 [6] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, 2018. [7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023. [8] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022. [9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019. [10] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. [11] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. [12] Hugging Face. Open LLM leaderboard. https://huggingface.co/spaces/ HuggingFaceH4/open_llm_leaderboard, 2024. [13] Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. Multi-head state space model for speech recognition. In Proceedings of INTERSPEECH 2023, pages 241–245, 2023. [14] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. [15] Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022. [16] Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994. [17] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. [18] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021. [19] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021. [20] Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, 2022. [21] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020. 14 [22] Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD: An expert-annotated NLP dataset for legal contract review. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. [24] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. [25] Greg Kamradt. Needle in a haystack - pressure testing llms. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack/, 2023. [26] Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. [27] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. [28] Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1311–1316, 2019. [29] Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. [30] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. [31] Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. Benchmarking natural language understanding services for building conversational agents. In Increasing Naturalness and Flexibility in Spoken Dialogue Interaction: 10th International Workshop on Spoken Dialogue Systems, pages 165–183. Springer, 2021. [32] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011. [33] Sabrina J Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y Lee, Benoît Sagot, et al. Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP. arXiv preprint arXiv:2112.10508, 2021. [34] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022. [35] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022. 15 [36] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024. [37] Jonathan Pilault, Mahan Fathi, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, and Ross Goroshin. Block-state transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [38] Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, and Sebastian Jaszczur. MoE-Mamba: Efficient selective state space models with mixture of experts. arXiv preprint arXiv:2401.04081, 2024. [39] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023. [40] Michael Poli, Jue Wang, Stefano Massaroli, Jeffrey Quesnelle, Ryan Carlow, Eric Nguyen, and Armin Thomas. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. https://github.com/togethercomputer/stripedhyena, 2023. [41] Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, 2023. [42] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. [43] George Saon, Ankit Gupta, and Xiaodong Cui. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [44] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016. [45] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. [46] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. [47] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. [48] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging BIGBench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023. [49] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024. [50] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 16 [52] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. [53] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019. [54] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022. [55] Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022. 17 mamba是什么？

Cursor

Mamba 是一种状态空间模型（State Space Model, SSM），专为高效处理长序列任务而设计，是近年来序列建模领域的一个重要新架构。它由 Albert Gu 和 Tri Dao 等人在 2023 年提出，论文为 Mamba: Linear-time Sequence Modeling with Selective State Spaces。

1. Mamba 的核心思想

状态空间模型（SSM）：
传统的 SSM（如线性动态系统）通过“状态”来记忆和处理序列信息，理论上可以捕捉很长的依赖关系。
高效长序列建模：
Mamba 通过特殊的结构和优化，使得 SSM 能像 Transformer 一样高效地并行训练，但推理和内存消耗远低于 Transformer，尤其适合超长序列（如10万token以上）。
选择性状态空间（Selective State Space）：
Mamba 引入了“选择性”机制，能动态决定哪些信息需要被记忆和传递，提升了模型的表达能力和效率。

2. Mamba 与 Transformer 的区别

Transformer 依赖自注意力（Self-Attention），每个 token 都要和所有 token 交互，计算和内存随序列长度二次增长，长序列时效率低下。
Mamba 采用 SSM 结构，推理时计算和内存随序列长度线性增长，极大提升了长序列处理能力。
Mamba 还可以和 Transformer 结合（如 Jamba），兼顾长距离建模能力和高效推理。

3. Mamba 的优势

线性复杂度：推理和内存消耗随序列长度线性增长，适合超长文本、音频、DNA等任务。
并行训练：通过巧妙设计，Mamba 训练时可以像 Transformer 一样并行，效率高。
可与注意力机制混合：如 Jamba 这样，将 Mamba 层和 Transformer 层交替，兼顾性能和效率。

4. 应用场景

长文本/代码/文档建模
音频、语音、基因序列等长序列任务
大语言模型（LLM）中的长上下文支持

5. 一句话总结

Mamba 是一种高效的状态空间序列建模架构，能以线性复杂度处理超长序列，是 Transformer 的有力补充和替代方案之一。

如需更详细的原理、代码示例或与 Transformer 的对比分析，欢迎继续提问！

User

图片转文字

Cursor

代码图片的文字内容如下：

const Store = {
  jwt: null,
  get loggedIn() {
    return this.jwt !== null
  }
}

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop=="jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
    }
  }
});

如需进一步解释代码含义，请随时告知！

const Store = {
  jwt: null,
  get loggedIn() {
    return this.jwt !== null
  }
}

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop=="jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
    }
  }
});

User

图片转文字

Cursor

图片中的代码内容如下：

const errors = [];
if (name.length < 4) errors.push("Enter your complete name");
if (password.length < 7) errors.push("Enter a password with at least 7 characters");
if (email.length < 4) errors.push("Enter your complete email");
if (password!=passwordConfirmation) errors.push("Passwords don't match");

if (errors.length==0) {
    const response = await API.register(name, email, password);
    if (response.success) {
        app.Store
        app.Router.go("/account/");
    } else {
        app.showError(response.message);
    }
} else {
    app.showError(errors.join(". "));
}

如需进一步解释代码含义，请随时告知！

const errors = [];
if (name.length < 4) errors.push("Enter your complete name");
if (password.length < 7) errors.push("Enter a password with at least 7 characters");
if (email.length < 4) errors.push("Enter your complete email");
if (password!=passwordConfirmation) errors.push("Passwords don't match");

if (errors.length==0) {
    const response = await API.register(name, email, password);
    if (response.success) {
        app.Store
        app.Router.go("/account/");
    } else {
        app.showError(response.message);
    }
} else {
    app.showError(errors.join(". "));
}

User

formdata 后端Go语言如何处理？

Cursor

在 Go 语言后端处理前端通过 formdata（通常是 multipart/form-data，比如表单上传文件或普通表单数据）时，常用的方式如下：

1. 处理普通表单数据

如果前端用 formdata 发送普通键值对（如用户名、密码等），Go 后端可以这样处理：

func handler(w http.ResponseWriter, r *http.Request) {
    // 解析表单（包括 multipart/form-data 和 application/x-www-form-urlencoded）
    err := r.ParseMultipartForm(32 << 20) // 32MB 最大内存
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    // 获取表单字段
    username := r.FormValue("username")
    password := r.FormValue("password")

    fmt.Fprintf(w, "username: %s, password: %s", username, password)
}

如果只是普通表单（非文件），也可以用 r.ParseForm()，但 ParseMultipartForm 兼容性更好。

2. 处理文件上传

如果前端用 formdata 上传文件，Go 后端这样处理：

func uploadHandler(w http.ResponseWriter, r *http.Request) {
    // 限制最大上传大小（如10MB）
    r.ParseMultipartForm(10 << 20)

    // 获取文件和文件头
    file, handler, err := r.FormFile("file")
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }
    defer file.Close()

    // 读取文件内容或保存到本地
    f, err := os.Create("./uploads/" + handler.Filename)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    defer f.Close()
    io.Copy(f, file)

    fmt.Fprintf(w, "File uploaded successfully: %s", handler.Filename)
}

3. 结合表单字段和文件

你可以同时获取表单字段和文件：

func uploadHandler(w http.ResponseWriter, r *http.Request) {
    r.ParseMultipartForm(10 << 20)

    // 获取文本字段
    username := r.FormValue("username")

    // 获取文件
    file, handler, err := r.FormFile("avatar")
    // ...同上

    fmt.Fprintf(w, "User: %s, File: %s", username, handler.Filename)
}

4. 路由注册示例

http.HandleFunc("/upload", uploadHandler)
http.ListenAndServe(":8080", nil)

5. 注意事项

前端 formdata 的 key 要和后端获取的字段名一致。
文件上传时，r.FormFile("file") 的 "file" 要和前端 <input type="file" name="file"> 或 formdata.append("file", ...) 的 key 一致。
如果只处理普通表单数据（非文件），可以用 r.ParseForm()。

如需更详细的例子（如用 Gin、Echo 等框架），或有特殊需求，欢迎补充说明！

func handler(w http.ResponseWriter, r *http.Request) {
    // 解析表单（包括 multipart/form-data 和 application/x-www-form-urlencoded）
    err := r.ParseMultipartForm(32 << 20) // 32MB 最大内存
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    // 获取表单字段
    username := r.FormValue("username")
    password := r.FormValue("password")

    fmt.Fprintf(w, "username: %s, password: %s", username, password)
}

func uploadHandler(w http.ResponseWriter, r *http.Request) {
    // 限制最大上传大小（如10MB）
    r.ParseMultipartForm(10 << 20)

    // 获取文件和文件头
    file, handler, err := r.FormFile("file")
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }
    defer file.Close()

    // 读取文件内容或保存到本地
    f, err := os.Create("./uploads/" + handler.Filename)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    defer f.Close()
    io.Copy(f, file)

    fmt.Fprintf(w, "File uploaded successfully: %s", handler.Filename)
}

func uploadHandler(w http.ResponseWriter, r *http.Request) {
    r.ParseMultipartForm(10 << 20)

    // 获取文本字段
    username := r.FormValue("username")

    // 获取文件
    file, handler, err := r.FormFile("avatar")
    // ...同上

    fmt.Fprintf(w, "User: %s, File: %s", username, handler.Filename)
}

http.HandleFunc("/upload", uploadHandler)
http.ListenAndServe(":8080", nil)

User

图片转文字

Cursor

图片中的代码内容如下：

login: async (event) => {
    event.preventDefault();
    const email = document.getElementById("login-email").value;
    const password = document.getElementById("login-password").value;

    const errors = [];
    if (password.length < 7) errors.push("Enter a password with at least 7 characters");
    if (email.length < 4) errors.push("Enter your complete email");

    if (errors.length==0) {
        const response = await API.login(email, password);
        if (response.success) {
            app.Store.jwt = response.jwt;
            app.Router.go("/account/");
        } else {
            app.showError(response.message);
        }
    } else {
        app.showError(errors.join(". "));
    }
}

如需进一步解释代码含义，请随时告知！

login: async (event) => {
    event.preventDefault();
    const email = document.getElementById("login-email").value;
    const password = document.getElementById("login-password").value;

    const errors = [];
    if (password.length < 7) errors.push("Enter a password with at least 7 characters");
    if (email.length < 4) errors.push("Enter your complete email");

    if (errors.length==0) {
        const response = await API.login(email, password);
        if (response.success) {
            app.Store.jwt = response.jwt;
            app.Router.go("/account/");
        } else {
            app.showError(response.message);
        }
    } else {
        app.showError(errors.join(". "));
    }
}

User

Uncaught (in promise) TypeError: ‘set’ on proxy: trap returned falsish for property ‘jwt’ at Object.login (app.js:95:31) login @ app.js:95 await in login onsubmit @ login:140

这个错误是？

Cursor

这个错误信息：

Uncaught (in promise) TypeError: 'set' on proxy: trap returned falsish for property 'jwt'
    at Object.login (app.js:95:31)

意思是：
你用 Proxy 包裹了一个对象（如 Store），并自定义了 set 方法（trap），但你的 set 方法没有返回 true，导致 Proxy 认为设置属性失败，于是抛出了这个错误。

详细解释

在 JavaScript 的 Proxy 里，set trap 必须返回一个布尔值（true 表示设置成功，false 表示失败）。
如果你没有显式返回 true，默认返回 undefined（即“falsish”），就会报这个错。

你的代码类似这样：

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop == "jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
      // 这里缺少 return true
    }
    // 这里也缺少 return true/false
  }
});

解决方法

在 set trap 结尾加上 return true;：

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop == "jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
    } else {
      target[prop] = value;
    }
    return true; // 必须返回 true
  }
});

总结

Proxy 的 set trap 必须返回 true（成功）或 false（失败）。
你没返回，默认是 undefined，所以报错。
加上 return true; 即可解决。

如需进一步解释或有其他报错，欢迎继续提问！

Uncaught (in promise) TypeError: 'set' on proxy: trap returned falsish for property 'jwt'
    at Object.login (app.js:95:31)

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop == "jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
      // 这里缺少 return true
    }
    // 这里也缺少 return true/false
  }
});

const proxiedStore = new Proxy(Store, {
  set: (target, prop, value) => {
    if (prop == "jwt") {
      target[prop] = value;
      localStorage.setItem("jwt", value);
    } else {
      target[prop] = value;
    }
    return true; // 必须返回 true
  }
});

User

这些指标是什么意思？

Cursor

这张图展示的是网页性能的核心指标（Core Web Vitals），主要用于衡量用户体验。下面是每个指标的解释：

1. Largest Contentful Paint (LCP)

含义：最大内容绘制时间。指页面中最大可见内容（如大图片、主标题等）渲染完成所需的时间。
你的数据：1.35秒（绿色，说明很好）
标准：小于2.5秒为好，2.5-4秒需改进，大于4秒差。
LCP element：img，说明最大内容是图片。

2. Cumulative Layout Shift (CLS)

含义：累计布局偏移。衡量页面在加载过程中内容的“跳动”或“抖动”程度（比如图片加载后把文字挤下去）。
你的数据：0.19（橙色，说明需要改进）
标准：小于0.1为好，0.1-0.25需改进，大于0.25差。
Worst cluster：2 shifts，说明有两次明显的布局跳动。

3. Interaction to Next Paint (INP)

含义：交互到下次绘制的时间。衡量用户与页面交互（如点击、输入）后，页面响应并完成视觉更新的速度。
你的数据：64毫秒（绿色，非常快）
标准：小于200ms为好，200-500ms需改进，大于500ms差。

总结

LCP 1.35s：很好，页面主内容加载很快。
CLS 0.19：有点高，页面有可见的跳动，需要优化（比如图片/广告/字体加载时预留空间）。
INP 64ms：非常好，页面响应很快。

优化建议：

重点优化 CLS，减少页面加载时的布局变化（如为图片、广告等元素预设宽高，避免异步内容插入导致跳动）。

如需具体优化方法或更详细解释，欢迎继续提问！

User

Plumnber是什么意思？

Cursor

你问的 Plumnber，这个词其实是拼写错误，正确的单词应该是 Plumber。

Plumber 的意思

1. 英文释义

Plumber（名词）：水管工，管道工
指负责安装、修理水管、下水道、暖气管道等的工人。

2. 常见用法

He called a plumber to fix the leaking pipe.
他叫了个水管工来修漏水的管子。

3. IT/编程领域

在编程领域，Plumber 也是一些工具/库的名字，比如：
- R 语言的 Plumber：一个用来把 R 代码快速变成 Web API 的包（Plumber for R）。
- 其他编程语言或框架中也可能有叫 plumber 的库，通常和“管道”、“流处理”相关。

总结

Plumnber 没有实际含义，应该是 Plumber 的拼写错误。
Plumber 主要指“水管工”，在 IT 领域也常用作工具/库名。

如有具体上下文（比如你在哪看到的 Plumnber），可以补充说明，我可以帮你更精确地解释！

User

PS C:\Users\tannal\tannalwork\projects\parley> cp .\slides-typst-meetup-24-lilaq\ .. cp : 具有指定名称 C:\Users\tannal\tannalwork\projects\slides-typst-meetup-24-lilaq 的项已存在。所在位置行:1 字符: 1

cp .\slides-typst-meetup-24-lilaq\ ..

  + CategoryInfo          : ResourceExists: (C:\Users\tannal...meetup-24-lilaq:String) [Copy-Item], IOException
  + FullyQualifiedErrorId : DirectoryExist,Microsoft.PowerShell.Commands.CopyItemCommand

PS C:\Users\tannal\tannalwork\projects\parley>

poershell命令

Cursor

你遇到的 PowerShell 报错：

cp : 具有指定名称 C:\Users\tannal\tannalwork\projects\slides-typst-meetup-24-lilaq 的项已存在。

意思是目标目录已经存在，所以 cp（Copy-Item）命令无法覆盖已存在的同名目录。

解决方法

1. 加上 `-Recurse -Force` 并指定新目录名

如果你想把目录复制为一个新名字（比如 slides-typst-meetup-24-lilaq-copy）：

cp .\slides-typst-meetup-24-lilaq ..\slides-typst-meetup-24-lilaq-copy -Recurse

2. 覆盖已存在的目录（慎用！）

PowerShell 的 cp/Copy-Item 默认不会覆盖已存在的目录。如果你确实要覆盖，可以先删除目标目录：

Remove-Item ..\slides-typst-meetup-24-lilaq -Recurse -Force
cp .\slides-typst-meetup-24-lilaq ..\ -Recurse

⚠️ 注意：Remove-Item 会删除目标目录及其所有内容，请谨慎操作！

如果你只是想把内容合并到上一级目录下的同名目录：

cp .\slides-typst-meetup-24-lilaq\* ..\slides-typst-meetup-24-lilaq\ -Recurse

这样会把内容复制进去，而不是整个目录。

总结

报错原因：目标目录已存在，PowerShell 不允许直接覆盖。
解决办法：换新名字、先删除再复制，或只复制内容。

如需更具体的命令或有其他需求，欢迎补充说明！

cp : 具有指定名称 C:\Users\tannal\tannalwork\projects\slides-typst-meetup-24-lilaq 的项已存在。

cp .\slides-typst-meetup-24-lilaq ..\slides-typst-meetup-24-lilaq-copy -Recurse

Remove-Item ..\slides-typst-meetup-24-lilaq -Recurse -Force
cp .\slides-typst-meetup-24-lilaq ..\ -Recurse

cp .\slides-typst-meetup-24-lilaq\* ..\slides-typst-meetup-24-lilaq\ -Recurse

User

PS C:\Users\tannal\tannalwork\projects\slides-typst-meetup-24-lilaq> git submodule update –init fatal: Unable to find current revision in submodule path ’lilaq’ PS C:\Users\tannal\tannalwork\projects\slides-typst-meetup-24-lilaq>

看看这个？

Cursor

你遇到的这个错误：

fatal: Unable to find current revision in submodule path 'lilaq'

意思是：
Git 在尝试初始化或更新子模块 lilaq 时，找不到它应该检出的那一个 commit（revision）。这通常有以下几种常见原因：

常见原因与解决办法

1. 主仓库的 `.gitmodules` 或 `lilaq` 目录指向了一个不存在的 commit

可能是主仓库的 submodule 指向了 lilaq 仓库中一个已经被删除或还没 push 上去的 commit。
解决办法：
- 进入 lilaq 目录，执行 git fetch，确保所有远程分支和 commit 都拉下来了。
- 回到主仓库目录，重新执行：
```
git submodule update --init --recursive
```

2. 你本地的 submodule 目录不干净或损坏

可能之前的操作中，lilaq 目录内容不完整或有冲突。
解决办法：
- 先备份你需要的内容，然后删除子模块目录：
```
rm -rf lilaq
```
- 再重新初始化 submodule：
```
git submodule update --init --recursive
```

3. 远程 submodule 仓库有问题

比如 submodule 的远程仓库被删除、权限问题、网络问题等。
解决办法：
- 检查 .gitmodules 文件中的 URL 是否正确，可以手动访问一下看看能不能 clone。
- 如果有权限问题，检查你的 SSH key 或 HTTPS 认证。

4. 主仓库的 submodule 指向了一个未 push 的 commit

比如主仓库的维护者在本地更新了 submodule，但没有把 submodule 的新 commit push 到远程。
解决办法：
- 联系主仓库的维护者，让他们 push 子模块仓库的最新 commit。

检查命令

你可以用以下命令排查：

# 查看 .gitmodules 文件
cat .gitmodules

# 查看 submodule 的 commit id
cat .git/modules/lilaq/HEAD

# 进入子模块目录，检查 log
cd lilaq
git log --oneline

总结

这个错误大概率是主仓库指向了 submodule 仓库中一个本地有但远程没有的 commit。
最常见的解决办法是：让 submodule 仓库的维护者 push 最新 commit 到远程，或者你自己 git fetch 拉全所有 commit。
如果你只是想跳过 submodule，可以临时删除 .gitmodules 相关配置，但这不是根本解决办法。

如需进一步排查，可以把 .gitmodules 文件内容和 git status、git log 的输出贴出来，我可以帮你更具体分析！

fatal: Unable to find current revision in submodule path 'lilaq'

    git submodule update --init --recursive

    rm -rf lilaq

    git submodule update --init --recursive

# 查看 .gitmodules 文件
cat .gitmodules

# 查看 submodule 的 commit id
cat .git/modules/lilaq/HEAD

# 进入子模块目录，检查 log
cd lilaq
git log --oneline

User

go get 加不加-u 有什么区别？

Cursor

你的问题是：go get 加 -u 和不加 -u 有什么区别？

1. 不加 `-u` 的情况

go get example.com/some/module

作用：安装（或升级）指定的 module 到当前项目的 go.mod，只会获取你指定的包和它当前依赖的版本。
依赖升级：不会自动升级已有依赖的版本（只会下载缺失的依赖）。

2. 加 `-u` 的情况

go get -u example.com/some/module

作用：安装（或升级）指定的 module，并且递归升级该包及其所有依赖到最新的次要版本或补丁版本（不跨大版本）。
依赖升级：会自动升级依赖树中所有可以升级的包（不改变主版本号）。

3. 加 `-u=patch` 的情况（Go 1.17+）

go get -u=patch example.com/some/module

作用：只升级到最新的补丁版本（不升级次要版本）。

4. 总结对比

命令	依赖升级行为
go get xxx	只获取指定包及其依赖，不升级已有依赖
go get -u xxx	升级指定包及其所有依赖到最新次/补丁版
go get -u=patch xxx	只升级到最新补丁版本

5. 注意事项

Go 1.17+ 推荐用 go install 安装可执行程序，用 go get 管理依赖。
Go 1.18+，go get 不再用于安装可执行文件，只用于依赖管理。

6. 例子

假设你的 go.mod 里已经有 github.com/gin-gonic/gin v1.7.0，你执行：

go get github.com/gin-gonic/gin
只会确保 v1.7.0 存在，不会升级。
go get -u github.com/gin-gonic/gin
会升级到 v1.x 的最新版本（比如 v1.9.1），并且递归升级依赖。

如需更详细的版本管理说明或遇到具体依赖冲突，欢迎继续提问！

go get example.com/some/module

go get -u example.com/some/module

go get -u=patch example.com/some/module

User

authorization Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6IjI5MzQ0ODIxMzNAcXEuY29tIiwiZXhwIjoxNzUwNDg5NTcyLCJpZCI6MCwibmFtZSI6IiJ9.3P5LDZX7amz74G6sVGaMhd7OpEEC6aufRCUDSG2Av1w

这个bearer是什么意思？

Cursor

你问的 Bearer 是什么，下面详细解释：

1. `Bearer` 的含义

Bearer 是一种HTTP 认证类型，常用于 API 的身份验证。
它表示持有者令牌（Bearer Token），即只要你“持有”这个 token，就被认为有权限访问资源。

2. 典型用法

在 HTTP 请求头中，通常这样写：

Authorization: Bearer <token>

比如你给的例子：

Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6IjI5MzQ0ODIxMzNAcXEuY29tIiwiZXhwIjoxNzUwNDg5NTcyLCJpZCI6MCwibmFtZSI6IiJ9.3P5LDZX7amz74G6sVGaMhd7OpEEC6aufRCUDSG2Av1w

3. 这个 token 是什么？

你这个 token 是一个JWT（JSON Web Token），是一种常见的 Bearer Token 格式。
JWT 里通常包含用户信息、过期时间等，服务端可以用它来校验用户身份。

4. Bearer Token 的特点

无状态：服务端只需校验 token，无需保存会话。
安全性：只要拿到 token 的人都能访问资源，所以要通过 HTTPS 传输，防止被窃取。
常见于：OAuth 2.0、RESTful API、前后端分离项目等。

5. 总结

Bearer 表示“持有者令牌”认证方式。
后面跟的字符串是 token（你这里是 JWT），用来证明你的身份。
只要有这个 token，就能访问需要认证的接口。

如需了解 JWT 的结构、如何生成/校验，或安全注意事项，欢迎继续提问！

Authorization: Bearer <token>

Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6IjI5MzQ0ODIxMzNAcXEuY29tIiwiZXhwIjoxNzUwNDg5NTcyLCJpZCI6MCwibmFtZSI6IiJ9.3P5LDZX7amz74G6sVGaMhd7OpEEC6aufRCUDSG2Av1w

User

Enabling Language Models to Fill in the Blanks Chris Donahue Stanford University Mina Lee Stanford University {cdonahue,minalee,pliang}@cs.stanford.edu Percy Liang Stanford University Abstract We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document. While infilling could enable rich functionality especially for writing assistance tools, more attention has been devoted to language modeling—a special case of infilling where text is predicted at the end of a document. In this paper, we aim to extend the capabilities of language models (LMs) to the more general task of infilling. To this end, we train (or fine-tune) off-the-shelf LMs on sequences containing the concatenation of artificially-masked text and the text which was masked. We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics. Furthermore, we show that humans have difficulty identifying sentences infilled by our approach as machinegenerated in the domain of short stories. 1 Introduction Text infilling is the task of predicting missing spans of text which are consistent with the preceding and subsequent text.1 Systems capable of infilling have the potential to enable rich applications such as assisting humans in editing or revising text (Shih et al., 2019), connecting fragmented ideas (AI21, 2019), and restoring ancient documents (Assael et al., 2019). Rather than targeting a particular application, our goal here is to provide a general, flexible, and simple infilling framework which can convincingly infill in a variety of domains. A special case of infilling is language modeling: predicting text given preceding but not subsequent text.2 Language models are (1) capable of generat1Text infilling is a generalization of the cloze task (Taylor, 1953)—cloze historically refers to infilling individual words. 2 In this paper, language modeling always refers to ordinary LMs, i.e., “unidirectional,” “autoregressive,” or “left-to-right.” She ate leftover pasta for lunch. She ate [blank] for [blank]. leftover pasta [answer] lunch [answer] Data Input Target Our Infilling Framework She ate [blank] for [blank]. She ate leftover pasta for lunch. Infilling Task Input Output Train Language Model Infilling Input [sep] Target Data Input [sep] Target Output Figure 1: We consider the task of infilling, which takes incomplete text as input and outputs completed text. To tackle this task, our framework constructs training examples by masking random spans to generate pairs of inputs (text with blanks) and targets (answers for each blank). We then train unidirectional language models on the concatenation of each pair. Once trained, a model takes text input with blanks, predicts the answers, and then combines them to produce the output. ing remarkably coherent text (Zellers et al., 2019; See et al., 2019), (2) efficient at generating text, and (3) conceptually simple, but cannot infill effectively as they can only leverage context in a single direction (usually the past). On the other hand, strategies such as BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2019) are able to infill using both preceding and subsequent text. However, their use of bidirectional attention limits their infilling capabilities to fixed-length spans. This is problematic as—for many applications—we may not know the length of a missing span a priori. Zhu et al. (2019) propose a method capable of infilling variable-length spans, but it uses a specialized architecture and hence cannot easily leverage large-scale pre-trained models. In this work, we present infilling by language modeling (ILM), a simple framework which enarXiv:2005.05339v2 [cs.CL] 10 Sep 2020 ables LMs to infill variable-length spans while preserving their aforementioned benefits: generation quality, efficient sampling, and conceptual simplicity. Our framework involves a straightforward formulation of the infilling task which, as we demonstrate, can be learned effectively by existing LM architectures. As shown in Figure 1, our approach concatenates artificially-masked text with the text which was masked, and adopts a standard LM training (or fine-tuning) procedure on such examples. Once trained, infilling can be performed for a document with blanks by using the LM to generate text and then replacing the blanks with this text. In addition to its conceptual simplicity, our experiments show that ILM enables off-the-shelf LMs to infill effectively. Furthermore, we find that infilling performance improves when starting from a large-scale pre-trained LM (as opposed to training from scratch), suggesting an additional benefit of using our model-agnostic framework compared to approaches which require specialized architectures. We provide an interactive web demo of models trained under our framework. This demo can infill multiple variable-length spans with different granularities (e.g. words, n-grams, and sentences) on the domains of short stories, scientific abstracts, and song lyrics: https://chrisdonahue.com/ilm. All code, data, and trained models are available at https://github.com/chrisdonahue/ilm and also on the CodaLab platform at https: //worksheets.codalab.org/worksheets/ 0x9987b5d9cce74cf4b2a5f84b54ee447b. 2 Problem Statement The task of infilling is to take incomplete text x˜, containing one or more missing spans, and return completed text x. Let [blank] be a placeholder for a contiguous sequence (span) of one or more missing tokens. Then, incomplete text x˜ is a sequence of tokens some of which are [blank]. In order to map x˜ to x, an infilling strategy must specify both how many and which tokens to generate for each [blank]. Note that there may be many reasonable x for a given x˜. Hence, we are interested in learning a distribution p(x | x˜). 3 Infilling by Language Modeling In this section, we describe our ILM framework. We first outline a simple reparametrization of the infilling task. Then, we define a procedure for automatically generating suitable training examples which can be fed to an off-the-shelf LM. 3.1 Formulation Fedus et al. (2018) explore an infilling framework where LMs are trained on concatenations of x˜ and x, i.e., they use LMs to directly predict x given x˜. While their approach is effective at infilling individual words, it is somewhat redundant as the model must “predict” the unmasked text in x˜. Additionally, a model is not guaranteed to exactly reproduce the unmasked text. Instead, we make the trivial observation that it suffices to predict only the missing spans y which will replace the [blank] tokens in x˜. We can then construct x by simply replacing [blank] tokens in x˜ with predicted spans y in a deterministic fashion. In order to handle multiple variable-length spans, we pose y as the concatenation of all missing spans separated by special [answer] tokens (one [answer] per [blank]) (Figure 1). We can thus cast infilling as learning p(y | x˜) without loss of generality. 3.2 Training Given a corpus consisting of complete text examples, our framework first manufactures infilling examples and then trains an LM on these examples. To produce an infilling example for a given x, we first sample an x˜ from a stochastic function Mask(x) which randomly replaces some number of spans in x with [blank] tokens. Then, we concatenate together the spans which were replaced— separated by [answer] tokens—to form a training target y. Finally, we construct the complete infilling example by concatenating x˜, [sep], and y (see Figure 2 for a complete example). We train (or fine-tune) LMs on these infilling examples using standard LM training methodology, yielding models of the form pθ(y | x˜). Specifically, we train GPT-2 (Radford et al., 2019) off the shelf, but any LM can potentially be used. This framework has several advantages. First, it incurs almost no computational overhead compared to language modeling. Specifically, if there are k missing spans in x˜, the concatenation of x˜ and y contains only 2k+ 1 more tokens than x (one [blank] and one [answer] per missing span plus one [sep]). As k is usually small (averaging around 2 per example in our experiments), sequence lengths remain similar to those encountered for the same x during language modeling. In contrast, using LMs to directly predict x from x˜ as in Fedus et al. (2018) effectively doubles the sequence length of x. This is particularly problematic when considering models like GPT-2 whose memory usage grows quadratically with sequence length. Second, our framework requires minimal change (three additional tokens) to an existing LM’s vocabulary. Finally, because the entirety of x˜ is in the “past” when predicting y, the ILM framework combines the ability to attend to incorporate context on both sides of a blank with the simplicity of decoding from LMs. 4 Experimental Setup We design our experiments to determine if training an off-the-shelf LM architecture with our ILM framework can produce effective infilling models for a variety of datasets. Specifically, we train on three datasets of different sizes and semantics (details in Appendix A): short STORIES (Mostafazadeh et al., 2016), CS paper ABSTRACTS, and song LYRICS. 4.1 Mask Function A benefit of the ILM framework is that it can be trained to infill spans corrupted by arbitrary mask functions. Here, we explore a mask function which simultaneously trains models to infill different granularities of text; specifically, words, n-grams, sentences, paragraphs, and documents. By using a unique special token per granularity (e.g. [blank word]), this mask function offers users coarse but intuitive control over the length of the spans to be infilled. We configure our mask function to mask each token in a given document with around 15% probability, echoing the configuration of Devlin et al. (2019). However, instead of masking individual tokens uniformly at random, we perform a preorder traversal of the granularity hierarchy tree, randomly masking entire subtrees with 3% probability. For the datasets we consider, this results in a marginal token mask rate of about 15% (details in Appendix B). While we train to infill several different granularities, we primarily evaluate and discuss the ability of our models to infill sentences for brevity. Quantitative results of our models on other granularities can be found in Appendix D, and granularity functionality can also be explored in our web demo. 4.2 Task and Model Configurations For all experiments, we train the same architecture (GPT-2 “small”) using the same hyperparameters She ate leftover pasta for lunch. She ate [blank] for [blank]. She ate leftover pasta for lunch. [end] .lunch for leftover pasta ate She [end] She ate [blank] for [blank]. She ate leftover pasta for lunch. [end] She ate [blank] for [blank]. [sep] leftover pasta [answer] lunch [answer] Data Masked LM LM-Rev LM-All ILM Training Examples for Different Strategies Figure 2: Training examples for three baseline infilling strategies and ILM on a given artificially-masked sentence. For each strategy, we train the same architecture (GPT-2) on such examples. At both training and test time, examples are fed from left to right; anything to the left of a green target is available to the model as context when predicting the target. Precisely, LM only considers past context, and LM-Rev only considers future. LM-All considers all available context but uses long sequence lengths. Our proposed ILM considers all context while using fewer tokens. (Appendix C) while varying the infilling strategy and dataset. In addition to our proposed ILM strategy for infilling, we consider three baseline strategies: (1) language modeling (LM; “infilling” based only on past context), (2) reverse language modeling (LM-Rev; “infilling” based only on future context), and (3) language modeling based on all available context (LM-All). LM-All simply concatenates x and x˜ together as in Fedus et al. (2018). LM-All represents arguably the simplest way one could conceive of infilling with LMs, but results in long sequence lengths. Training examples for all strategies are depicted in Figure 2. For each strategy, we also vary whether training is initialized from the pre-trained GPT-2 model or from scratch. Despite discrepancies between the pre-training and our fine-tuning for most infilling strategies, all of the infilling experiments initialized from the pre-trained checkpoint performed better than their from-scratch counterparts. This indicates that ILM can effectively leverage large-scale language modeling pre-training to improve infilling performance. Henceforth, we will only discuss the models initialized from the pre-trained checkpoint, though we report quantitative performance for all models in Appendix D. For the models trained on STORIES and ABSTRACTS, we trained models to convergence using early stopping based on the validation set perplexity (PPL) of each model computed only on the masked tokens. These models took about a day to reach STO ABS LYR Length LM 18.3 27.9 27.7 1.00 LM-Rev 27.1 46.5 34.3 1.00 LM-All 15.6 22.3 21.4 1.81 ILM 15.6 22.4 22.6 1.01 Table 1: Quantitative evaluation results. We report test set perplexity (PPL) on the sentence infilling task for different model configurations on all three datasets, as well as average length of all test set examples in tokens relative to that of the original sequence (lower is better for all columns). Our proposed ILM framework achieves better PPL than both LM and LM-Rev, implying that it is able to take advantage of both past and future context. ILM achieves similar PPL to LM-All with shorter sequence lengths (hence less memory). their early stopping criteria on a single GPU. For the larger LYRICS dataset, we trained models for 2 epochs (about two days on a single GPU). 5 Quantitative Evaluation We evaluate the quantitative performance of our models on the sentence infilling task by measuring PPL on test data.3 In this setting, a sentence is selected at random and masked out, and we measure the likelihood assigned by a model to the masked sentence in the context of the rest of the document. Regardless of differences in the ordering and number of tokens that each strategy uses to represent a test example, PPL is always computed only for the span of tokens comprising the original sentence (e.g. green tokens in Figure 2). Table 1 shows that across all datasets, ILM outperforms models which see only past or future context (LM and LM-Rev respectively), implying that our proposed framework is able to take advantage of bidirectional context despite using unidirectional models. Additionally, while one might expect LMAll to outperform ILM because its training examples more closely “resemble” those of standard LMs, ILM achieves similar performance to LMAll. This indicates that GPT-2 is able to effectively learn the “syntax” of ILM examples and achieve reasonable infilling performance with shorter sequences (and hence with much less memory usage). We also observe that models trained via ILM perform similarly on the special case of language mod3Overlap-based metrics such as BLEU score (Papineni et al., 2002) are not appropriate for evaluating infilling as there are many realistic infills that have no word-level overlap with the original, e.g., “a sandwich” instead of “leftover pasta.” eling compared to the models which were trained only on language modeling (Appendix D.1). This suggests that ILM does not just repurpose LMs to infill, but rather extends their capabilities while maintaining their original functionality. 6 Human Evaluation In addition to our quantitative evaluation, we seek to evaluate the qualitative performance of ILM. To this end, we sample a story from the STORIES test set and randomly replace one of its five humanwritten sentences with a model output. Then, we task human annotators on Amazon Mechanical Turk with identifying which of the sentences in a story was machine-generated (details in Appendix E). We compare our ILM model to three baseline infilling strategies: an LM (context beyond the replaced sentence was discarded), the best model (self-attention; SA) from Zhu et al. (2019), and the pre-trained BERT (base) model (Devlin et al., 2019). All approaches except for BERT were first fine-tuned on the STORIES dataset. To infill using BERT, we replace the tokens representing the original sentence with mask tokens, and then generate text by replacing mask tokens one at a time (conditioning on previously-generated tokens). While vocabulary differences make it is less useful to compare PPL for the SA and BERT baselines to our GPT-2-based strategies, we can still meaningfully compare them in this human evaluation setting. For each approach we compute a score, which we define as the percentage of examples where the annotator did not correctly identify the machinegenerated sentence. Therefore, a higher score implies a better (more natural, human-like) model. We collect 100 responses for each model and report the scores in Table 2, with qualitative examples in Figure 3 and Appendix E. Of the four strategies, ILM achieves the highest score, implying that sentences infilled by ILM are harder for humans to recognize as fake than those produced by other strategies. Somewhat surprisingly, we observed that despite only observing past context the LM model performed better than BERT and SA. BERT may have performed poorly due to the intrinsic difficulty of finding convincing infills with a precise length in tokens. SA may have performed poorly because, unlike LM and ILM, it was not initialized from a large-scaled pre-trained LM. BERT SA LM ILM Score (%) 20 29 41 45 Table 2: Human evaluation results. We use BERT (Devlin et al., 2019), the best model from Zhu et al. (2019) (SA), and our LM and ILM models to replace random sentences in five-sentence stories from the STORIES test set. Then, we task humans with identifying which sentence of the five was generated by a machine. We report the score of each model: the percentage of infilled stories where the human failed to identify the machine-generated sentence. Our ILM model achieves a higher score than all of the other models. Note that the max score is effectively 80%, as a perfect model would cause annotators to randomly choose one of the five sentences. BERT SA LM ILM Human favoritea “, Mary brightly said. She wasn’t sure she had to go to the store. She went to check the tv. Patty knew her friends wanted pizza. She also had the place looking spotless. Example Story with Masked Sentence Patty was excited about having her friends over. She had been working hard preparing the food. [blank] All of her friends arrived and were seated at the table. Patty had a great time with her friends. Figure 3: Example of a short story in our STORIES dataset with its third sentence masked, and sentences infilled by different models. The sentences generated by BERT and SA models are off-topic, the sentence generated by LM model is irrelevant to the future context, while the ones generated by ILM and Human successfully account for both previous and future context. 7 Related Work Methodology. A number of systems have the capability to infill but have practical drawbacks. Many systems are unable to automatically determine span length, and thus, can only infill fixedlength spans (Fedus et al., 2018; Devlin et al., 2019; Yang et al., 2019; Joshi et al., 2019; Gu et al., 2019; Liu et al., 2019). Methods such as BERT present additional challenges during inference (Wang and Cho, 2019). Rudinger et al. (2015) frame narrative cloze as a generation task and employ language models, but they only consider one infill of a fixed length. Zhu et al. (2019); Shen et al. (2020) infill multiple variable-length sequences, but these approaches require the masked context to be iteratively updated and reprocessed to fill in blanks one a time. In contrast, our approach appends infilled text to the context and does not require reprocessing the entire input sequence for each blank. AI21 (2019) train an LM which can fill in the middle of a paragraph given the first and last sentences—our work generalizes to such capabilities. Task. The cloze task (Taylor, 1953) evaluates language proficiency by asking systems to fill in randomly-deleted words by examining context. Cloze has been extended in the forms of discourse (Deyes, 1984) and narrative cloze (Chambers and Jurafsky, 2008), which remove phrases and narrative events respectively. Recently, cloze has been used not only for evaluation, but also to improve text generation quality (Fedus et al., 2018) and transfer learning (Devlin et al., 2019) (under the name “masked language modeling”). Text infilling can be thought of as generalizing the cloze task from single words to spans of unknown length. Raffel et al. (2019) explore infilling as a pre-training objective to improve downstream performance on inference tasks; our work focuses on generation. Story generation. Recent work seeks to generate stories given a title and storyline (Yao et al., 2019), entities (Clark et al., 2018), premise (Fan et al., 2018), or surrounding context and rare words (Ippolito et al., 2019). Our work differs in that we aim to build systems capable of making predictions based only on text context, rather than aspects specific to stories (e.g. storyline). 8 Conclusion We presented a simple strategy for the task of infilling which leverages language models. Our approach is capable of infilling sentences which humans have difficulty recognizing as machinegenerated. Furthermore, we demonstrated that our infilling framework is effective when starting from large-scale pre-trained LMs, which may be useful in limited data settings. In future work, we plan to incorporate these features into co-creation systems which assist humans in the writing process. We hope that our work encourages more investigation of infilling, which may be a key missing element of current writing assistance tools. Acknowledgments This work was funded by DARPA CwC under ARO prime contract no. W911NF-15-1-0462. We thank all reviewers for their helpful comments. References AI21. 2019. HAIM: A modest step towards controllable text generation. AI21 Labs Blog. Yannis Assael, Thea Sommerschield, and Jonathan Prag. 2019. Restoring ancient text using deep learning: a case study on greek epigraphy. arXiv:1910.06262. N. Chambers and D. Jurafsky. 2008. Unsupervised learning of narrative event chains. In Human Language Technology and Association for Computational Linguistics (HLT/ACL). Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. Neural text generation in stories using entity representations as context. In Association for Computational Linguistics: Human Language Technologies. J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171– 4186. T. Deyes. 1984. Towards an authentic ‘discourse cloze’. Applied Linguistics, 5(2):128–137. A. Fan, M. Lewis, and Y. Dauphin. 2018. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833. W. Fedus, I. Goodfellow, and A. M. Dai. 2018. Maskgan: Better text generation via filling in the. In International Conference on Learning Representations (ICLR). J. Gu, Q. Liu, and K. Cho. 2019. Insertion-based decoding with automatically inferred generation order. arXiv preprint arXiv:1902.01370. D. Ippolito, D. Grangier, C. Callison-Burch, and D. Eck. 2019. Unsupervised hierarchical story infilling. In NAACL Workshop on Narrative Understanding, pages 37–43. M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. 2019. SpanBERT: Improving pretraining by representing and predicting spans. arXiv preprint arXiv:1907.10529. D. Liu, J. Fu, P. Liu, and J. Lv. 2019. TIGS: An inference algorithm for text infilling with gradient search. arXiv preprint arXiv:1905.10752. N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In North American Association for Computational Linguistics (NAACL). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In ACL. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8). C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. R. Rudinger, P. Rastogi, F. Ferraro, and B. V. Durme. 2015. Script induction as language modeling. In Empirical Methods in Natural Language Processing (EMNLP). Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher D Manning. 2019. Do massively pretrained language models make better storytellers? arXiv:1909.10705. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Tianxiao Shen, Victor Quach, Regina Barzilay, and Tommi Jaakkola. 2020. Blank language models. arXiv:2002.03079. Y. Shih, W. Chang, and Y. Yang. 2019. XL-Editor: Post-editing sentences with xlnet. arXiv preprint arXiv:1910.10479. W. L. Taylor. 1953. “Cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433. A. Wang and K. Cho. 2019. BERT has a mouth, and it must speak: BERT as a Markov random field language model. arXiv preprint arXiv:1902.04094. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. L. Yao, N. Peng, R. Weischedel, K. Knight, D. Zhao, and R. Yan. 2019. Plan-and-write: Towards better automatic storytelling. In Association for the Advancement of Artificial Intelligence (AAAI). Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. In NeurIPS. W. Zhu, Z. Hu, and E. Xing. 2019. Text infilling. arXiv preprint arXiv:1901.00158. A Datasets

STORIES (100K examples, 5M words) Short stories from the ROCStories dataset (Mostafazadeh et al., 2016). Each story contains a title and five sentences.
ABSTRACTS (200K examples, 30M words) Abstracts from CS papers on arXiv
LYRICS (2M examples, 60M words) Song lyrics from lyrics.com We experimented on multiple datasets to demonstrate that our framework was not custom tailored to a single domain. On the STORIES and ABSTRACTS datasets, we include metadata (story title, paper subject matter, etc.), as the first “paragraph” of the document. By providing these paragraphs (Appendix B), our infilling model implicitly learns to summarize (e.g. infill a title given a story), and do conditional generation (e.g. infill a story given a title). On the LYRICS dataset, infilling models may be especially helpful to humans; external aid in the form of rhyming dictionaries is already commonly employed in this domain. To ensure that all experiments were trained on the same data, we removed infilling examples which would have exceeded our training sequence length of 256 tokens for the model with the longest sequence length (LM-All). This removed no examples from STORIES, a small fraction of examples from LYRICS, and a substantial number of examples from ABSTRACTS. B Masking function We design a mask function which takes the entire document and selectively masks several span granularities: words, n-grams, sentences, paragraphs, and entire documents. Accordingly, models trained via ILM on this masking function offer users the ability to specify the granularity of text to infill at a particular location. This allows users to have coarse but intuitive control over infilling length, so that multiple paragraphs are not generated when the user was expecting a single word. Our masking function first constructs a tree of the training example (using the natural hierarchy of documents, paragraphs, sentences, and words). Then, using a pre-order tree traversal, each subtree is masked with 3% probability (or ignored if any of its ancestors are already masked). If the entire document (root node of the tree) is masked, then the infilling model’s job is equivalent to that of a language model. If a word (leaf) is selected to be masked, 50% of the time we mask that individual word, otherwise we mask an n-gram of random length between 1 and min(8, # words left in the sentence) words (inclusive). Note that a word may comprise multiple tokens, as GPT-2 uses sub-word tokenization (Sennrich et al., 2015). We chose the value of 3% as, for the datasets we considered, it resulted in a marginal token mask rate of around 15%, echoing the configuration of Devlin et al. (2019). We add special tokens for each granularity to our model’s vocabulary (e.g. [blank word]), so that the user may specify which granularity they would like the infilling model to produce. This functionality can be explored in our demo: https: //chrisdonahue.com/ilm. While we focus on this specific mask function in this paper, we structured the ILM codebase to allow users to train infilling models for completely different use cases. Users need only define a new mask function which takes complete documents and outputs lists of character-level spans representing the desired spans to be masked. C Hyperparameters We use early stopping based on the PPL of the model on infilling the masked token for the validation set. We train all models using the default fine-tuning parameters specified in the transformers library (Wolf et al., 2019), except that we use a batch size of 24 and a sequence length of 256. Note that the most straightforward way of training an LM on ILM examples (Section 3.2) is to maximize the likelihood of the entire concatenated example: x˜, [sep], and y. This trains the model to predict tokens in x˜ even though such behavior is not necessary at inference time as x˜ will always be fully-specified. Nevertheless, we found that this additional supervision improved performance when evaluating model PPL of y. Conveniently, this is also the default behavior when adapting existing LM training code for use with ILM. D Evaluation on language modeling and infilling other granularities Our quantitative evaluation (Section 5) examined the sentence infilling performance of GPT-2 initialized from the large-scale pre-trained checkpoint STO ABS LYR LM (scratch) 33.4 52.1 25.1 LM-Rev (scratch) 32.9 53.9 24.7 LM-All (scratch) 30.4 44.6 26.2 ILM (scratch) 30.8 45.3 30.6 LM 17.6 25.7 20.8 LM-Rev 25.1 36.7 23.7 LM-All 17.8 25.2 21.5 ILM 18.1 23.9 23.0 Table 3: Document infilling PPL (or language modeling) of ILM and baselines initialized either from scratch or from the pre-trained checkpoint across three datasets. Note that PPL of ILM is similar to LM, implying that our infilling strategy can reasonably maintain the ability to perform language modeling while extending the ability to infill. STO ABS LYR LM (scratch) 34.0 52.8 28.9 LM-Rev (scratch) 34.9 59.3 30.4 LM-All (scratch) 27.0 46.2 24.3 ILM (scratch) 25.5 46.0 27.5 LM 17.5 25.5 23.9 LM-Rev 26.5 39.0 29.2 LM-All 15.1 24.4 19.3 ILM 14.9 23.5 20.2 Table 4: Mixture infilling PPL of all models (a mixture of all granularities). after fine-tuning on different datasets and infilling strategies. Here, we report PPL for GPT-2 both initialized from scratch and from the pre-trained checkpoint for several other configurations: language modeling, a mixture of granularities, specific granularities, and language modeling. D.1 Language modeling In Table 3, we report PPL for “document infilling,” which is equivalent to language modeling (because x˜ is always [blank document]). Because of how we structured our mask function (Appendix B), 3% of infilling examples consist of the entire document masked out, which results in the ability of our ILM framework to perform standard infilling. We see that performance of ILM is similar to that of LM on this task, even though ILM sees far fewer examples of language modeling compared to LM. STO ABS LYR LM (scratch) 35.6 51.5 25.1 LM-Rev (scratch) 34.8 65.1 24.7 LM-All (scratch) 33.4 45.0 26.2 ILM (scratch) 34.3 45.3 30.6 LM 18.3 24.2 20.8 LM-Rev 26.5 42.8 23.7 LM-All 20.4 23.4 21.5 ILM 20.7 22.5 23.0 Table 5: Paragraph infilling PPL of all models. STO ABS LYR LM (scratch) 36.0 65.4 33.5 LM-Rev (scratch) 35.1 92.2 35.8 LM-All (scratch) 27.1 53.8 27.1 ILM (scratch) 26.7 51.0 31.0 LM 18.3 27.9 27.7 LM-Rev 27.1 46.5 34.3 LM-All 15.6 22.3 21.4 ILM 15.6 22.4 22.6 Table 6: Sentence infilling PPL of all models. D.2 Mixture of granularities In Table 4, we report results for a mixture of granularities. Specifically, we run the same mask function we use for training (Appendix B) on our test data and evaluate PPL on the masked spans. This reflects general infilling ability across a wide variety of granularities (and hence lengths). Unlike our other quantitative evaluations, there may be multiple variable-length spans missing from each example in this evaluation. Results are similar to that of sentence infilling. Namely, that ILM outperforms LM and LM-Rev and is similar to LM-All despite using much less memory. D.3 Individual granularities In Tables 5 to 8 we report PPL values for infilling performance on paragraphs, sentences, n-grams, and words, respectively, across the three datasets. For each granularity, we create one infilling example per document from the test set with exactly one masked span (randomly chosen from all spans of that granularity for that document). Then, we compute PPL only on the tokens which comprise the masked span, i.e., PPL is computed for all models on exactly the same set of tokens. Across all granularities, we observe that ILM outperforms STO ABS LYR LM (scratch) 36.1 62.5 34.1 LM-Rev (scratch) 36.4 89.1 36.3 LM-All (scratch) 26.4 60.1 24.3 ILM (scratch) 23.1 49.5 26.3 LM 19.2 25.5 28.2 LM-Rev 26.6 45.0 34.8 LM-All 14.5 20.5 18.6 ILM 13.8 21.5 18.8 Table 7: N-gram infilling PPL of all models. STO ABS LYR LM (scratch) 32.3 57.2 34.8 LM-Rev (scratch) 31.6 100.0 36.7 LM-All (scratch) 12.6 51.8 12.5 ILM (scratch) 9.2 37.9 12.2 LM 17.1 23.0 28.7 LM-Rev 24.1 45.0 35.1 LM-All 7.5 15.8 9.5 ILM 5.4 14.2 8.5 Table 8: Word infilling PPL of all models. LM and LM-Rev and either outperforms or is comparable with LM-All while using less memory. E Details on human evaluation For human evaluation, we sampled 100 stories from the test set of the STORIES dataset. From each story, we masked out one sentence at a time, thereby resulting in 500 stories with masked sentences. Then we used these stories as context and tasked each model with infilling the masked sentence. We compared 8 models in total. In addition to the four models reported in Section 6 (BERT, SA, LM, and ILM), we included the models which are initialized from scratch (as opposed to initialized from the large-scale pre-trained checkpoint) for exhaustive comparison. Furthermore, to filter out spam, we used a control model which always generates “This sentence was generated by a computer.” Lastly, we included the original sentence from the dataset as a reference model (Human) to sanity check the max score is around 80%. Each annotator was shown 8 stories, one from each model, and was asked to identify one of the five sentences generated by machine (see Figure 4 for an example). Among the 100 collected responses, we filtered out 5 responses whose annotation for the control model was wrong. The quantitative and qualitative results can be found in Table 9 and Figure 5, respectively. All model outputs and responses of human evaluation can be found at https://github.com/chrisdonahue/ilm. Score (%) Control 0 BERT 20 SA 29 LM (scratch) 40 LM 41 ILM (scratch) 39 ILM 45 Human 78 Table 9: Human evaluation results. Identify one of the five sentences generated by machine. ○ Patty was excited about having her friends over. ○ She had been working hard preparing the food. ○ Patty knew her friends wanted pizza. ○ All of her friends arrived and were seated at the table. ○ Patty had a great time with her friends. Figure 4: Example of a task and instruction for human evaluation on Amazon Mechanical Turk. Example Story with Masked Sentence Lily always loved to read. She wondered sometimes, what it would be like to write a book? [blank] Lily did well in the course, and during it, wrote a short book. BERT SA LM ILM Human I held her hand and helped her sit. Of her, but she didn’t know her. She practiced reading a lot every week. Finally, in middle school, her teacher introduced her to writing that. She decided to take a course on fiction writing. BERT SA LM ILM Human Or rather, what the next job would be now. I was going out I was going to the beach. I put on about thirty sugar cubes. The issues are getting so many people crazy. I could never catch up and each week got worse. Example Story with Masked Sentence My old job kept asking too much of me. Every Wednesday there was a ton of new work. [blank] Eventually I got too far behind and had to quit. I will look for a new job. BERT SA LM ILM Human Today was the first concert that she had to see every where. She was going to go to the play. When she went on stage she smoothly walked right past the audience. When she got on stage the band was amazing. As soon as she got on the audience applauded. Example Story with Masked Sentence Yesterday was Kelly’s first concert. She was nervous to get on stage. [blank] Kelly was then happy. She couldn’t wait to do it again. Figure 5: Examples of sentence-level infills by different models.

选题背景，研究背景（分析本选题范畴内尚未得到较好解决的学术或实践难题，阐述选题的缘起与依据）

Cursor

好的，以下是根据你给出的论文内容，整理出的**“选题背景/研究背景”**分析，适合用于开题报告或论文前言部分：

选题背景 / 研究背景

1. 文本补全（Text Infilling）的重要性与应用前景

文本补全（text infilling）是指在文本中任意位置预测缺失片段，使得补全后的文本在语义和语法上与上下文一致。这一任务在实际应用中具有广泛的价值，例如：

写作辅助：帮助用户编辑、润色、扩展或修复文本，提高写作效率和质量。
碎片化信息整合：将零散的想法、片段或笔记自动串联成连贯的文本。
古文/残缺文献修复：用于修复历史文献、古籍、碑刻等因损毁而缺失的内容。
对话系统与智能问答：提升系统对不完整输入的理解和响应能力。

2. 现有方法的局限性

尽管文本补全任务具有重要意义，但目前主流的自然语言生成模型在该任务上仍存在明显不足，主要体现在以下几个方面：

单向性限制：传统的语言模型（如GPT系列）主要用于“语言建模”任务，即只能基于已有文本的“过去”内容进行预测，无法充分利用“未来”上下文信息。这使得它们在补全文本中间缺失内容时表现不佳。
固定长度限制：如BERT、SpanBERT等基于掩码（mask）的双向模型，虽然能够利用前后文信息，但其补全能力通常局限于固定长度的掩码，难以灵活应对实际应用中变长缺失片段的补全需求。
架构复杂性与迁移性差：部分研究（如Zhu et al., 2019）提出了可变长度补全方法，但往往依赖于专门设计的模型结构，难以直接利用大规模预训练语言模型的能力，迁移和扩展性有限。
推理效率与内存消耗：某些方法（如LM-All）虽然能利用全部上下文，但会导致输入序列长度大幅增加，带来更高的计算和内存开销，影响实际应用。

3. 学术与实践难题

如何让现有大规模预训练语言模型（如GPT-2）具备高效、灵活的文本补全能力，既能利用双向上下文，又能支持变长片段补全？
如何在不显著增加模型复杂度和推理成本的前提下，提升补全文本的自然度和多样性？
如何让补全结果在多种文本类型（如故事、摘要、歌词等）中都具备良好的通用性和可控性？

4. 选题缘起与依据

随着大语言模型（LLM）在自然语言处理领域的广泛应用，如何进一步拓展其生成能力、提升其在实际写作和编辑场景中的实用性，成为当前学术界和工业界关注的热点。文本补全作为语言生成的一个重要分支，尚未得到足够重视和系统性研究。现有方法要么局限于单向预测，要么难以处理变长缺失，要么对模型架构有较高要求，难以直接迁移和落地。

本研究正是基于上述学术和实践难题，提出了一种基于语言建模的通用文本补全框架（Infilling by Language Modeling, ILM），旨在充分利用现有预训练语言模型的能力，通过简单的训练数据重构和最小的架构改动，实现高效、灵活、通用的文本补全。该方法不仅提升了补全质量，还兼顾了推理效率和实际应用的可扩展性，为智能写作、文本修复等领域提供了新的解决思路和技术基础。

如需更精炼或更详细的版本，或需要结合具体应用场景（如写作辅助、古文修复等）进一步展开，请告知！

User

文献综述（系统梳理本选题相关的具有代表性的文献，分析相关研究的发展脉络与进展，评述已有研究存在的问题与不足）

Cursor

好的，以下是针对“文本补全/文本infilling”任务的文献综述，涵盖代表性文献、发展脉络、进展与不足，适合用于开题报告或论文综述部分：

文献综述

1. 任务起源与早期研究

文本补全（Text Infilling）任务最早可追溯到Cloze测试（Taylor, 1953），即让系统根据上下文填补被随机删除的单词。Cloze任务最初用于语言能力评测，后被广泛用于自然语言处理领域的模型训练与评估。

Taylor, W. L. (1953). “Cloze procedure”: A new tool for measuring readability.
提出Cloze测试，开启了基于上下文的文本补全研究。

随着NLP技术的发展，Cloze任务被扩展到更复杂的场景，如叙事Cloze（Chambers & Jurafsky, 2008），用于推断故事中的缺失事件。

2. 基于语言模型的补全方法

2.1 单向自回归语言模型

GPT系列（Radford et al., 2019）
采用自回归（左到右）方式生成文本，能够高效生成连贯文本，但只能利用“过去”上下文，难以补全中间缺失内容。

2.2 双向掩码语言模型

BERT（Devlin et al., 2019）
通过掩码（mask）机制，利用前后文信息预测被遮盖的单词。
- 优点：能利用双向上下文，补全效果好。
- 缺点：只能补全固定长度的掩码，难以处理变长缺失片段，且推理时需逐个token生成，效率较低。
SpanBERT（Joshi et al., 2019）
扩展BERT，支持span级别的掩码补全，但仍然局限于固定长度span。

2.3 变长补全与插入式生成

Text Infilling（Zhu et al., 2019）
提出可变长度span的补全方法，采用特殊的插入式生成架构。
- 优点：支持变长补全。
- 缺点：需要专门设计的模型结构，难以直接利用大规模预训练模型。
Insertion-based Decoding（Gu et al., 2019）
通过自动推断生成顺序，实现插入式生成，提升了灵活性，但训练和推理复杂度较高。

3. 生成式补全与迁移学习

MaskGAN（Fedus et al., 2018）
利用GAN思想进行文本补全，能生成多样化的补全内容，但训练不稳定，难以大规模应用。
T5（Raffel et al., 2019）
统一文本到文本的预训练框架，支持span infilling作为预训练目标，提升了下游任务表现，但主要关注迁移学习而非生成质量。

4. 近期代表性工作

ILM（Infilling by Language Modeling, Donahue et al., 2020）
本文提出的ILM方法，通过重构训练样本，将带有空白的文本与答案拼接，利用标准自回归语言模型（如GPT-2）进行训练和推理。
- 优点：无需修改模型结构，能利用大规模预训练模型，支持变长补全，推理高效。
- 进展：在故事、摘要、歌词等多领域取得了优异的补全效果，且人类难以区分其补全句与真实句。

5. 现有研究的不足与挑战

灵活性不足：BERT等掩码模型只能补全固定长度，插入式生成方法结构复杂，难以直接迁移。
推理效率问题：部分方法推理时需多次迭代，效率低下，难以应用于实际场景。
泛化能力有限：部分方法仅在特定领域或特定类型文本上有效，缺乏通用性。
对大规模预训练模型的利用不充分：许多方法无法直接利用如GPT-2、GPT-3等大模型的强大生成能力。
补全内容的多样性与自然度：如何生成既多样又自然、与上下文高度一致的补全内容，仍是难点。

综述总结

总体来看，文本补全任务经历了从Cloze测试到基于深度学习的自回归、掩码、插入式等多种方法的发展。近年来，如何高效利用大规模预训练语言模型，支持变长、灵活、自然的文本补全，成为该领域的研究热点。ILM等新方法为解决上述难题提供了新的思路，但在补全多样性、推理效率、领域泛化等方面仍有提升空间。

如需更详细的文献引用格式、某一方向的深入分析，或结合中文文献补充，欢迎继续提问！

User

一大段文字，不要分点

Cursor

文本补全（Text Infilling）任务最早起源于Cloze测试（Taylor, 1953），即让系统根据上下文填补被随机删除的单词，这一任务最初用于语言能力评测，后来被广泛应用于自然语言处理领域的模型训练与评估。随着NLP技术的发展，Cloze任务被扩展到更复杂的场景，如叙事Cloze（Chambers & Jurafsky, 2008），用于推断故事中的缺失事件。近年来，随着深度学习和大规模预训练语言模型的兴起，文本补全任务的研究取得了显著进展。早期的自回归语言模型（如GPT系列，Radford et al., 2019）采用单向生成方式，能够高效生成连贯文本，但只能利用“过去”上下文，难以补全中间缺失内容。为了解决这一问题，BERT（Devlin et al., 2019）等双向掩码语言模型通过mask机制，利用前后文信息预测被遮盖的单词，提升了补全能力，但其补全能力通常局限于固定长度的掩码，难以灵活应对实际应用中变长缺失片段的补全需求。SpanBERT（Joshi et al., 2019）进一步扩展了BERT，支持span级别的掩码补全，但仍然局限于固定长度span。为了解决变长补全问题，Zhu等（2019）提出了可变长度span的补全方法，采用特殊的插入式生成架构，虽然支持变长补全，但需要专门设计的模型结构，难以直接利用大规模预训练模型。Insertion-based Decoding（Gu et al., 2019）通过自动推断生成顺序，实现插入式生成，提升了灵活性，但训练和推理复杂度较高。MaskGAN（Fedus et al., 2018）利用GAN思想进行文本补全，能生成多样化的补全内容，但训练不稳定，难以大规模应用。T5（Raffel et al., 2019）提出了统一的文本到文本预训练框架，支持span infilling作为预训练目标，提升了下游任务表现，但主要关注迁移学习而非生成质量。近期，Donahue等（2020）提出的ILM（Infilling by Language Modeling）方法，通过重构训练样本，将带有空白的文本与答案拼接，利用标准自回归语言模型（如GPT-2）进行训练和推理，无需修改模型结构，能充分利用大规模预训练模型，支持变长补全，推理高效，并在故事、摘要、歌词等多领域取得了优异的补全效果，且人类难以区分其补全句与真实句。尽管如此，现有研究仍存在一些不足，如BERT等掩码模型只能补全固定长度，插入式生成方法结构复杂，难以直接迁移，部分方法推理时需多次迭代，效率低下，难以应用于实际场景，部分方法仅在特定领域或特定类型文本上有效，缺乏通用性，许多方法无法直接利用如GPT-2、GPT-3等大模型的强大生成能力，以及如何生成既多样又自然、与上下文高度一致的补全内容仍是难点。总体来看，文本补全任务经历了从Cloze测试到基于深度学习的自回归、掩码、插入式等多种方法的发展，近年来，如何高效利用大规模预训练语言模型，支持变长、灵活、自然的文本补全，成为该领域的研究热点，ILM等新方法为解决上述难题提供了新的思路，但在补全多样性、推理效率、领域泛化等方面仍有提升空间。