Learning an Image Editing Model without Image Editing Pairs

1Carnegie Mellon University,    2 Adobe Research
ArXiv 2025

We propose NP-Edit (No-Pair Edit), a framework for training image editing models using gradient feedback from a Vision–Language Model (VLM), requiring no paired supervision. For efficient training and effective VLM feedback, our formulation combines it with distribution matching loss to learn a few-step image editing model. Our findings show that performance improves directly with more powerful VLMs and larger datasets, demonstrating its strong potential and scalability.

Abstract

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting.

Results

Qualitative comparison (Local Image Editing)

We show a comparison of our method with leading image-editing models on Local Image Editing using GEdit-Benchmark [1]. Since none of the baseline models explicitly target few-step editing, we simply evaluate them with few-step sampling as well as show an example with multi-step setting (in the second column) for an upper-bound comparison. Our method performs comparably to the baselines, especially in the few-step setting, and can successfully follow different editing instructions while being consistent with the input reference image


Results
Qualitative comparison (Free-form Image Editing)

We compare our method with state-of-the-art baselines (trained on paired data) on Free-form Image Editing a.k.a Customization using DreamBooth [2] benchmark. Even without using any paired data during training, our method can successfully incorporate the text prompt while maintaining object identity with the reference image.


Results

Role of Dataset Scale and VLM Size


Our training dataset consists of reference images and corresponding editing instructions without the ground truth edited images. To study the impact of dataset size and VLM scale, we vary them and evaluate performance on GEdit-Benchmark using VIEScore [3]. We observe consistent gains with larger datasets. Similarly, a larger parameter VLM-backbone leads to better performance, underscoring the promise that our method can improve as more powerful VLMs are developed.



Comparison with RL-based technique Flow-GRPO


Reinforcement Learning (RL)-baesd techniques are a common post-training strategy for improving pre-trained models without paired supervision and can also leverage VLMs as the reward model, a similar setup to ours. However, RL relies on a reasonable initialization, and therefore we need to first train an image-editing model via Supervised Fine-Tuning (SFT) on a paired dataset. We then use Flow-GRPO [4], a widely used RL method for text-to-image diffusion, to post-train the SFT model further. Given the same VLM as the reward model, our method outperforms Flow-GRPO on GEdit-Benchmark as evaluated by VIEScore.

Limitation

Since our method is trained without pixel-level supervision of a ground truth edited image, during inference, we observe that the edited image may deviate from the input image in spatial details or fail to fully preserve subject identity. Adding a perceptual similarity loss (e.g., LPIPS [5]) between input and edited images alleviates this to some extent, though often at the cost of editing quality, such as failure to remove the bangs in the second row of the above image.

References
  1. Step1X-Edit: A Practical Framework for General Image Editing. https://github.com/stepfun-ai/Step1X-Edit/
  2. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. https://dreambooth.github.io
  3. VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation. https://github.com/TIGER-AI-Lab/VIEScore
  4. Flow-GRPO: Training Flow Matching Models via Online RL. https://github.com/yifan123/flow_grpo
  5. LPIPS: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. https://github.com/richzhang/PerceptualSimilarity

BibTeX

@article{kumari2025npedit,
        title={Learning an Image Editing Model without Image Editing Pairs},
        author={Kumari, Nupur and Wang, Sheng-Yu and Zhao, Nanxuan and Nitzan, Yotam and Li, Yuheng and Singh, Krishna Kumar and Zhang, Richard and Shechtman, Eli and Zhu, Jun-Yan and Huang, Xun},
        journal={arXiv preprint arXiv:},
        year={2025}
      }

Acknowledgements

We thank Gaurav Parmar, Maxwell Jones, and Ruihan Gao for their feedback and helpful discussions. This work was partly done while Nupur Kumari was interning at Adobe Research. The project was partly supported by Adobe Inc., the Packard Fellowship, the IITP grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project), NSF IIS-2239076, and NSF ISS-2403303.