CNNs
If we need to stride by more than two, we should do this later.
EDM’s lesson - if we want to aggregate spatial information, using attention instead of downsampling is better.
Zero-initialize layers before residual connections
Zero-initialize biases