Mason Wang

CNNs

If we need to stride by more than two, we should do this later.

EDM’s lesson - if we want to aggregate spatial information, using attention instead of downsampling is better.

Zero-initialize layers before residual connections

Zero-initialize biases