Mason Wang

ODISE

Open-vocabulary Diffusion-based panoptic segementation

k-means clustering of diffusion model’s internal representation

Use both CLIP and SD

Previous methods just use CLIP, but this not good for scene-level understanding, bad spatial relations between objects.

diffusion models compute cross attention between text embedding and interal visual representation

Fig 1, simply clustering diffusion model’s internal features does some segmentation

diffusion model -> mask generator of all possible concepts, trained with annotated masks, categorizes each mask into many categories by associating with text embeddings.

Training

Mask generator

Image caption supervision

Grounding loss

still have to keep reading

Last Reviewed: 10/28/2025