ODISE

Open-vocabulary Diffusion-based panoptic segementation

k-means clustering of diffusion model’s internal representation

Use both CLIP and SD

Previous methods just use CLIP, but this not good for scene-level understanding, bad spatial relations between objects.

diffusion models compute cross attention between text embedding and interal visual representation

Fig 1, simply clustering diffusion model’s internal features does some segmentation

diffusion model -> mask generator of all possible concepts, trained with annotated masks, categorizes each mask into many categories by associating with text embeddings.

Training

sample a noisy image
feed it into the unit, with the captions
the diffusion model’s visual representation for x depends on its caption
use implicit captioner when the caption is not available.
- instead of using a network to generate captions, get an implicit text embedding, using CLIP, MLP to implicit text embedding.
- only finetune the MLP

Mask generator

mask generator outputs N class-agnositc binary masks, can be any panoptic segementation network
pixel wise BCE loss, along with ground-truth masks
mask GT category label: if we have a lot of categories in the training set, encode all catgoeis with the frozen text encoder, then use a classification loss betwen all the training categories.
the probability is Softmax( net_out dot Text_encoder(C_train)) (this is probably what allows it to generalize).

Image caption supervision

extract nouns from each caption, treat them as grounding category labels
compute the simliarty between each image caption pair, with a grounding loss encouraging each noun to beg rounded by one or a few masked regions in the image.

Grounding loss

the loss is overall image-word similarity, you take the probability of mask embedding features z_i with each word, times (z_i, T(w_k)).
so it’s two similarities, both against all the words in the caption.
avoids penalizing regions that are not grounded by any word

still have to keep reading

Last Reviewed: 10/28/2025