Masked Image Modeling

Context Encoders - Feature Learning by Inpainting, 2016, Berkeley

2022 - MAE - masked autoencoder

Patches = one visual token

random masking

use ViT to encode visible patches into latents. put masked tokens back, after encoding into latents. then, decoder predicts unknown

need to mask a very large portion of patches to be useful.

initial, MAE just interpolates colors, eventually recovers image.

good for transfer learning - 75% masking ratio results it best transfer accuracy in language it’s 15%

information in langauge is less redundant (due to human beings)

NeurIPS 2022 - MAEs are spatiotemporal learners.

decoder is lighteight, <10% of the computation per token vs the encoder, full token sets are only processed by decoder

MAE - independent

Applications:

robotics - multi-view, medical images, 3D geometry, graphs, audio (spectrogram).

Last Reviewed: 10/25/2025