Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

1Bilkent University, 2Hacettepe University, 3Koc University
Inst-Inpaint Teaser

Image inpainting results of Inst-Inpaint on the image inpainting benchmark dataset named GQA-Inpaint, which we created for instructional image inpainting task.


Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors.

In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts.

We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.


Inst-Inpaint Architecture

Our work involves initially generating a dataset, GQA-Inpaint, for the proposed instructional image inpainting task. To create input/output pairs, we utilize the images and their scene graphs exist in the GQA dataset. (a) We first select an object of interest. (b) We perform instance segmentation to locate the object in the image. (c) We apply a state-of-the-art image inpainting method to erase the object. (d) Finally, we create a template-based textual prompt to describe the removal operation.

As a result, our GQA-Inpaint dataset includes a total of 147165 unique images and 41407 different instructions. Trained on this dataset, our Inst-Inpaint model is a text-based image inpainting method based on a conditioned Latent Diffusion Model which does not require any user-specified binary mask and perform object removal in a single step without predicting a mask, as in similar works.


We compare our inpainting results with the state-of-the-art models below and visualize the attention maps found based on the given instructions.

Clevr SOTA Comparison
GQA-Inpaint SOTA Comparison
GQA-Inpaint Paired Attention Maps

Hugging Face Demo