Shortcuts

Engine

Hook

Introduction

The hook mechanism is widely used in the OpenMMLab open-source algorithm library. Inserted in the Runner, the entire life cycle of the training process can be managed easily. You can learn more about the hook through related article.

Hooks only work after being registered into the runner. At present, hooks are mainly divided into two categories:

  • default hooks

Those hooks are registered by the runner by default. Generally, they fulfill some basic functions, and have default priority, you don’t need to modify the priority.

  • custom hooks

The custom hooks are registered through custom_hooks. Generally, they are hooks with enhanced functions. The priority needs to be specified in the configuration file. If you do not specify the priority of the hook, it will be set to ‘NORMAL’ by default.

Priority list:

Level Value
HIGHEST 0
VERY_HIGH 10
HIGH 30
ABOVE_NORMAL 40
NORMAL(default) 50
BELOW_NORMAL 60
LOW 70
VERY_LOW 90
LOWEST 100

The priority determines the execution order of the hooks. Before training, the log will print out the execution order of the hooks at each stage to facilitate debugging.

Default hooks

The following common hooks are already reigistered by default, which is implemented through register_default_hooks in MMEngine:

Hooks Usage Priority
RuntimeInfoHook update runtime information into message hub. VERY_HIGH (10)
IterTimerHook log the time spent during iteration. NORMAL (50)
DistSamplerSeedHook ensure distributed Sampler shuffle is active NORMAL (50)
LoggerHook collect logs from different components of Runner and write them to terminal, JSON file, tensorboard and wandb .etc. BELOW_NORMAL (60)
ParamSchedulerHook update some hyper-parameters in optimizer, e.g., learning rate and momentum. LOW (70)
CheckpointHook save checkpoints periodically. VERY_LOW (90)

Common Hooks implemented in MMEngine

Some hooks have been already implemented in MMEngine, they are:

Hooks Usage Priority
EMAHook apply Exponential Moving Average (EMA) on the model during training. NORMAL (50)
EmptyCacheHook release all unoccupied cached GPU memory during the process of training. NORMAL (50)
SyncBuffersHook synchronize model buffers such as running_mean and running_var in BN at the end of each epoch. NORMAL (50)
NaiveVisualizationHook Show or Write the predicted results during the process of testing. LOWEST (100)

Hooks implemented in MMSelfsup

Some hooks have been already implemented in MMSelfsup, they are:

An example:

Take DenseCLHook for example, this hook includes loss_lambda warmup in DenseCL.

loss_lambda is loss weight for the single and dense contrastive loss. Defaults to 0.5.

losses = dict()
losses['loss_single'] = loss_single * (1 - self.loss_lambda)
losses['loss_dense'] = loss_dense * self.loss_lambda

DenseCLHook is implemented as follows:

...
@HOOKS.register_module()
class DenseCLHook(Hook):
...
    def before_train_iter(self,
                          runner,
                          batch_idx: int,
                          data_batch: Optional[Sequence[dict]] = None) -> None:
...
        cur_iter = runner.iter
        if cur_iter >= self.start_iters:
            get_model(runner.model).loss_lambda = self.loss_lambda
        else:
            get_model(runner.model).loss_lambda = 0.

If the hook is already implemented in MMEngine or MMSelfsup, you can directly modify the config to use the hook as below

custom_hooks = [
    dict(type='MMEngineHook', a=a_value, b=b_value, priority='NORMAL')
]

such as using DenseCLHook, start_iters is 500:

custom_hooks = [
    dict(type='DenseCLHook', start_iters=500)
]

Optimizer

We will introduce Optimizer section through 3 different parts: Optimizer, Optimizer wrapper, and Constructor.

Optimizer

Customize optimizer supported by PyTorch

We have already supported all the optimizers implemented by PyTorch, see mmengine/optim/optimizer/builder.py. To use and modify them, please change the optimizer field of config files.

For example, if you want to use SGD, the modification could be as the following.

optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)

To modify the learning rate of the model, just modify the lr in the config of optimizer. You can also directly set other arguments according to the API doc of PyTorch.

For example, if you want to use Adam with the setting like torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False) in PyTorch, the config should looks like:

optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

Parameter-wise configuration

Some models may have some parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layer and the bias in each layer. To finely configure them, we can use the paramwise_cfg in optimizer.

For example, in MAE, we do not want to apply weight decay to the parameters of ln, bias, pos_embed, mask_token and cls_token, so we can use following config file:

optimizer = dict(
    type='AdamW', lr=1.5e-4 * 4096 / 256, betas=(0.9, 0.95), weight_decay=0.05)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    paramwise_cfg=dict(
        custom_keys={
            'ln': dict(decay_mult=0.0),
            'bias': dict(decay_mult=0.0),
            'pos_embed': dict(decay_mult=0.),
            'mask_token': dict(decay_mult=0.),
            'cls_token': dict(decay_mult=0.)
        }))

Implemented optimizers in MMSelfsup

In addition to optimizers implemented by PyTorch, we also implement a customized LARS in mmselfsup/engine/optimizers/lars.py. It implements layer-wise adaptive rate scaling for SGD.

optimizer = dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6)

Optimizer wrapper

Besides the basic function of PyTorch optimizers, we also provide some enhancement functions, such as gradient clipping, gradient accumulation, automatic mixed precision training, etc. Please refer to MMEngine for more details.

Gradient clipping

Currently we support clip_grad option in optim_wrapper, and you can refer to OptimWrapper and PyTorch Documentationfor more arguments . Here is an example:

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(
    type='OptimWrapper',
	optimizer=optimizer,
    clip_grad=dict(
        max_norm=0.2,
        norm_type=2))
# norm_type: type of the used p-norm, here norm_type is 2.

If clip_grad is not None, it will be the arguments of torch.nn.utils.clip_grad.clip_grad_norm_().

Gradient accumulation

When there is not enough computation resource, the batch size can only be set to a small value, which may degrade the performance of model. Gradient accumulation can be used to solve this problem.

Here is an example:

train_dataloader = dict(batch_size=64)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    accumulative_counts=4)

Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:

train_dataloader = dict(batch_size=256)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=optimizer,
    accumulative_counts=1)

Automatic mixed precision(AMP) training

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='AmpOptimWrapper', optimizer=optimizer)

The default setting of loss_scale of AmpOptimWrapper is dynamic.

Constructor

The constructor aims to build optimizer, optimizer wrapper and customize hyper-parameters of different layers. The key paramwise_cfg of optim_wrapper in configs controls this customization.

Constructors implemented in MMSelfsup

LearningRateDecayOptimWrapperConstructor sets different learning rates for different layers of backbone. Note: Currently, this optimizer constructor is built for ViT , Swin and MixMIN.

An example:

optim_wrapper = dict(
    type='AmpOptimWrapper',
    optimizer=dict(
        type='AdamW', lr=5e-3, model_type='swin', layer_decay_rate=0.9),
    clip_grad=dict(max_norm=5.0),
    paramwise_cfg=dict(
        norm_decay_mult=0.0,
        bias_decay_mult=0.0,
        custom_keys={
            '.absolute_pos_embed': dict(decay_mult=0.0),
            '.relative_position_bias_table': dict(decay_mult=0.0)
        }),
    constructor='mmselfsup.LearningRateDecayOptimWrapperConstructor')

Note: paramwise_cfg only supports the customization of weight_decay in LearningRateDecayOptimWrapperConstructor.

Read the Docs v: 1.x
Versions
latest
stable
1.x
dev-1.x
dev
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.