Dual-Process Image Generation

1UC Berkeley 2Runway

*Equal advising contribution

TL;DR: We distill deliberation from a VLM into a feed-forward image generator.




Abstract

Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes.


Dual-Process Distillation



Our method distills deliberation into a feed-forward image generation process. When generating an image, we ask a VLM questions about that image and backpropagate the resulting gradient to update the weights of the image generator. We construct our method such that it supports off-the-shelf VLMs and image generators without special re-training.


Visual Prompting



VLMs can not only be prompted with text but also visual prompts, or multimodal instructions jointly defined in image and text. We simply overlay the image instruction on top of the generated image.


Commonsense Inferences

Method: Example:
Click the underlined text for dropdown menu


We evaluate our method on CommonsenseT2I, a benchmark for commonsense understanding. For each example, we show an automatically generated question used by our method to verify an inference, as well as the image it generated. We also compare against prompt expansion and vanilla prompting.


Color Palette

Method: Example:
Click the underlined text for dropdown menu


While the base image generators cannot be natively instructed with color palettes, our method can implement this control as a visual prompt. We simply overlay the palette at the bottom of the generated image and ask the VLM if the colors match.


Line Weight

Method: Example:
Click the underlined text for dropdown menu


We use red lines to represent the desired line weight of a cartoon, through the thickness of the line. Through visual prompting, we can create visual abstractions and bind different meanings based on the question. Here we use red lines to represent line weight, and in the next example we use them to represent horizon position.


Horizon Position

Method: Example: Overlay?
Click the underlined text for dropdown menu


Here red lines are used to specify the horizon position, or the boundary at which the earth and sky meet. You can toggle the visibility of the overlay, which is also fed to the VLM for visual prompting. These examples make it clear why overlaying the image instruction is helpful; one can check the spatial alignment simply by assessing the distance between the actual horizon and the red line.


Relative Depth

Method: Example: Overlay?
Click the underlined text for dropdown menu


We use two red points with labels to specify relative depth ordering. Our method is able to follow truly multimodal instructions, where the VLM needs to distinguish between the two points by associating the red labels on the image instruction and the references to the labels in the question.


Visual Composition

Method: Example: Overlay?
Click the underlined text for dropdown menu


We use abstract paintings from Piet Mondrian [1, 2] to control visual composition. We also annotate additional instructions for how to interpret the painting in red. Mondrian's paintings can be interpreted as highly abstracted Dutch landscapes, with elements like trees and the horizon reduced to vertical and horizontal lines. Our method can be used to perform the inverse problem and produce images that match the structure of these paintings.


Acknowledgements

We would like to thank Anastasis Germanidis, Yining Shi, Alexander Pan, Chung Min Kim, Boyi Li, and David Chan for helpful discussions and feedback. We would also like to thank the folks at Stochastic Labs, especially Vero Bollow, Alexander Reben, Joel Simon, for previewing early prototypes of this work.

BibTeX


    @article{luo2025dualprocess,
      title={Dual-Process Image Generation},
      author={Grace Luo and Jonathan Granskog and Aleksander Holynski and Trevor Darrell},
      journal={arXiv preprint arXiv:2506.01955},
      year={2025}
    }