Lightning

1. Lightning Module

No need to loop over epochs/dataset, eval/train, enabling/disabling gradients

2. DataModule

Hook

3. Trainer

Gradient Clipping
DDP
min_epochs = minimum number of epochs (default 1)
max_epochs = 1000 (default 1000)
min_steps, max_steps (takes precedence over epochs)
check_val_every_n_epochs (default 1, maybe want 10, 100)
val_check_interval (in case epoch = a few days, integer after n steps, or float for percentage of epoch)
num_sanity_val_steps (sanity, default 2, 0 to turn off, -1 for full valid loop)
limit_train_batches, limit_val_batches, limit_test_batches, (10-20 epochs for an action, shorten length of train/valid loops)
limit_val_batches = 0.1 = 10% of valid batches, int = a number of batches
gpus = 8 (use 8 GPUs), or pass in a list of indices according to PCI ordering, -1 for ALL gpus
auto_select_gpus = True -> pick the right number of GPUs
log_gpu_memory=’all’, ‘min_max’ -> log memory usage for GPU, but may slow training, it uses nvidia-smi
- recommended to prevent memory leaks
benchmark=True -> results in speedups, but if the inputs change in size, not good.
deterministic=True -> reproducable, but slowdown.
num_nodes - number of compute nodes.
“ddp” - pytorch - syncs gradients.
batch_size = num_nodes * num_gpus * num_nodes
need to set the seed, since otherwise the model weights will all be different.
can’t use DDP in notebook/colab, or if you do fit multiple times. Then you need ddp_spawn, but that pickles everything, and you can’t have num_works > 0, and model on the original process will not be updatd.
DDP not supported on windows.
DataParrallel - splits data between batches, transfers across data a lot
DDP2 - examples with negative samples/contrastive training.
ddp_cpu - useful for debugging.

GPU training

delete all .cuda(), .to(device) calls
initialize tensors with device=self.device, and use register_buffer
z.type_as(x, device=self.device)

Mixed Precision

Lightning also casts buffers

Last Reviewed: 4/28/25