Engine¶

Engine
- Hook
- Optimizer

Hook¶

Introduction¶

The hook mechanism is widely used in the OpenMMLab open-source algorithm library. Inserted in the Runner, the entire life cycle of the training process can be managed easily. You can learn more about the hook through related article.

Hooks only work after being registered into the runner. At present, hooks are mainly divided into two categories:

default hooks

Those hooks are registered by the runner by default. Generally, they fulfill some basic functions, and have default priority, you don’t need to modify the priority.

custom hooks

The custom hooks are registered through custom_hooks. Generally, they are hooks with enhanced functions. The priority needs to be specified in the configuration file. If you do not specify the priority of the hook, it will be set to ‘NORMAL’ by default.

Priority list:

Level	Value
HIGHEST	0
VERY_HIGH	10
HIGH	30
ABOVE_NORMAL	40
NORMAL(default)	50
BELOW_NORMAL	60
LOW	70
VERY_LOW	90
LOWEST	100

The priority determines the execution order of the hooks. Before training, the log will print out the execution order of the hooks at each stage to facilitate debugging.

Default hooks¶

The following common hooks are already reigistered by default, which is implemented through register_default_hooks in MMEngine:

Hooks	Usage	Priority
RuntimeInfoHook	update runtime information into message hub.	VERY_HIGH (10)
IterTimerHook	log the time spent during iteration.	NORMAL (50)
DistSamplerSeedHook	ensure distributed Sampler shuffle is active	NORMAL (50)
LoggerHook	collect logs from different components of `Runner` and write them to terminal, JSON file, tensorboard and wandb .etc.	BELOW_NORMAL (60)
ParamSchedulerHook	update some hyper-parameters in optimizer, e.g., learning rate and momentum.	LOW (70)
CheckpointHook	save checkpoints periodically.	VERY_LOW (90)

Common Hooks implemented in MMEngine¶

Some hooks have been already implemented in MMEngine, they are:

Hooks	Usage	Priority
EMAHook	apply Exponential Moving Average (EMA) on the model during training.	NORMAL (50)
EmptyCacheHook	release all unoccupied cached GPU memory during the process of training.	NORMAL (50)
SyncBuffersHook	synchronize model buffers such as running_mean and running_var in BN at the end of each epoch.	NORMAL (50)
NaiveVisualizationHook	Show or Write the predicted results during the process of testing.	LOWEST (100)

Hooks implemented in MMSelfsup¶

Some hooks have been already implemented in MMSelfsup, they are:

An example:

Take DenseCLHook for example, this hook includes loss_lambda warmup in DenseCL.

loss_lambda is loss weight for the single and dense contrastive loss. Defaults to 0.5.

losses = dict()
losses['loss_single'] = loss_single * (1 - self.loss_lambda)
losses['loss_dense'] = loss_dense * self.loss_lambda

DenseCLHook is implemented as follows:

...
@HOOKS.register_module()
class DenseCLHook(Hook):
...
    def before_train_iter(self,
                          runner,
                          batch_idx: int,
                          data_batch: Optional[Sequence[dict]] = None) -> None:
...
        cur_iter = runner.iter
        if cur_iter >= self.start_iters:
            get_model(runner.model).loss_lambda = self.loss_lambda
        else:
            get_model(runner.model).loss_lambda = 0.

If the hook is already implemented in MMEngine or MMSelfsup, you can directly modify the config to use the hook as below

custom_hooks = [
    dict(type='MMEngineHook', a=a_value, b=b_value, priority='NORMAL')
]

such as using DenseCLHook, start_iters is 500:

custom_hooks = [
    dict(type='DenseCLHook', start_iters=500)
]

Optimizer¶

We will introduce Optimizer section through 3 different parts: Optimizer, Optimizer wrapper, and Constructor.

Optimizer¶

Customize optimizer supported by PyTorch¶

We have already supported all the optimizers implemented by PyTorch, see mmengine/optim/optimizer/builder.py. To use and modify them, please change the optimizer field of config files.

For example, if you want to use SGD, the modification could be as the following.

optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)

To modify the learning rate of the model, just modify the lr in the config of optimizer. You can also directly set other arguments according to the API doc of PyTorch.

For example, if you want to use Adam with the setting like torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False) in PyTorch, the config should looks like:

optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

Parameter-wise configuration¶

Some models may have some parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layer and the bias in each layer. To finely configure them, we can use the paramwise_cfg in optimizer.

For example, in MAE, we do not want to apply weight decay to the parameters of ln, bias, pos_embed, mask_token and cls_token, so we can use following config file:

optimizer = dict(
    type='AdamW', lr=1.5e-4 * 4096 / 256, betas=(0.9, 0.95), weight_decay=0.05)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    paramwise_cfg=dict(
        custom_keys={
            'ln': dict(decay_mult=0.0),
            'bias': dict(decay_mult=0.0),
            'pos_embed': dict(decay_mult=0.),
            'mask_token': dict(decay_mult=0.),
            'cls_token': dict(decay_mult=0.)
        }))

Implemented optimizers in MMSelfsup¶

LARS

In addition to optimizers implemented by PyTorch, we also implement a customized LARS in mmselfsup/engine/optimizers/lars.py. It implements layer-wise adaptive rate scaling for SGD.

optimizer = dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6)

Optimizer wrapper¶

Besides the basic function of PyTorch optimizers, we also provide some enhancement functions, such as gradient clipping, gradient accumulation, automatic mixed precision training, etc. Please refer to MMEngine for more details.

Gradient clipping¶

Currently we support clip_grad option in optim_wrapper, and you can refer to OptimWrapper and PyTorch Documentationfor more arguments . Here is an example:

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(
    type='OptimWrapper',
	optimizer=optimizer,
    clip_grad=dict(
        max_norm=0.2,
        norm_type=2))
# norm_type: type of the used p-norm, here norm_type is 2.

If clip_grad is not None, it will be the arguments of torch.nn.utils.clip_grad.clip_grad_norm_().

Gradient accumulation¶

When there is not enough computation resource, the batch size can only be set to a small value, which may degrade the performance of model. Gradient accumulation can be used to solve this problem.

Here is an example:

train_dataloader = dict(batch_size=64)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    accumulative_counts=4)

Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:

train_dataloader = dict(batch_size=256)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    accumulative_counts=1)

Automatic mixed precision(AMP) training¶

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='AmpOptimWrapper', optimizer=optimizer)

The default setting of loss_scale of AmpOptimWrapper is dynamic.

Constructor¶

The constructor aims to build optimizer, optimizer wrapper and customize hyper-parameters of different layers. The key paramwise_cfg of optim_wrapper in configs controls this customization.

Constructors implemented in MMSelfsup¶

LearningRateDecayOptimWrapperConstructor

LearningRateDecayOptimWrapperConstructor sets different learning rates for different layers of backbone. Note: Currently, this optimizer constructor is built for ViT , Swin and MixMIN.

An example:

optim_wrapper = dict(
    type='AmpOptimWrapper',
    optimizer=dict(
        type='AdamW', lr=5e-3, model_type='swin', layer_decay_rate=0.9),
    clip_grad=dict(max_norm=5.0),
    paramwise_cfg=dict(
        norm_decay_mult=0.0,
        bias_decay_mult=0.0,
        custom_keys={
            '.absolute_pos_embed': dict(decay_mult=0.0),
            '.relative_position_bias_table': dict(decay_mult=0.0)
        }),
    constructor='mmselfsup.LearningRateDecayOptimWrapperConstructor')

Note: paramwise_cfg only supports the customization of weight_decay in LearningRateDecayOptimWrapperConstructor.