SDEdit

Task - guided image synthesis - create/edit images w/ min. effort

previous = CGAN / GAN inversions, requires user annotations/training data or custom loss functions

SDEdit adds noise to the brush strokes, then denoises.

Can do stroke-based editing, or painting-to-image.

CGAN turns image into edited by training GAN inversion turns the image into latent, then edits.

more noise = more realistic, less faithful.

stroke based synthesis: better in terms of realism and satisfication from users

compositing: better faithfulness, better user satisfaction

Experiments

Tasks

stroke-based image generation
- metrics: user study for realism and faithfulness, L2, KID
- 4 GAN-based baselines
Stroke-based image editing
- qualitative
image blending
- metrics: L2, LPIPS within region that should be kept the same, user study for realism and faithfulness

Figures

Teaser figure (fig 1) showing image editing and stroke-based synthesis
Fig 3: Plots showing tradeoff between KID and L2 squared (fig 3)
fig 4 - qualitative comparisons to baselines
fig 5 - conditional generation examples
fig 6 - stroke-based editing examples

Tables

Comparing faithfulness, quality, and satisfication on stroke-based generation - two MTurk surveys asking about realism and satisfactory, and L2 faithfulness score - LSUN, CelebHQ. Use open-source models.
L2, KID of LSUN bedroom/church. Simulated strokes.

Baselines:

in domain GAN 1
in domain GAN 2
styleGAN
e4e

Metrics

Realism - measured by humans or NN faithfulness - similar to guide, L2 distance User metrics on realism and faithfulness

More Notes

Stroke based image editing:

human-created guides
simulated stroke paintings

-theoretical upper bound on guide and generation, based on L2 distance - to have a shot of being realistic need high noise level, to be faithful, noise level should not be too high. bad guides (white image) mean higher noise level needed - since the closest images to the input are quite far. - binary search for t0 based on user preference. - same t0 usually works for all reasonable guides in the same task.

we can also mask out part of the image we do not want to edit - additional channel

Conditional GANs - trained on original and edited images, data collection, model retraining -

GAN inversion - input is projected into latent space of a GAN, code is modified, image is resynthesized - all need different losses for different tasks.

other generative models - less used for editing, choi 2021 does conditional image synthesis assuming conditions can be measured from the underlying true image.

Other notes

unlike other inverse problems, do not know measurement function (image to sketch function) realism and faithfulness are not positively correlated, there are random realistic images or use of score-based models to solve inverse problems and methods requiring paired datasets do not apply key hyperparameter is t_0

experimental details in appendix