注意
您正在阅读 MMSelfSup 0.x 版本的文档,而 MMSelfSup 0.x 版本将会在 2022 年末 开始逐步停止维护。我们建议您及时升级到 MMSelfSup 1.0.0rc 版本,享受由 OpenMMLab 2.0 带来的更多新特性和更佳的性能表现。阅读 MMSelfSup 1.0.0rc 的 发版日志, 代码 和 文档 获取更多信息。
欢迎来到 MMSelfSup 的中文文档!¶
前提¶
在这一节中,我们展示了如何用 PyTorch 准备环境。
MMselfSup 可在 Linux 上运行 (Windows 和 macOS 平台不完全支持)。 要求 Python 3.6+, CUDA 9.2+ 和 PyTorch 1.5+。
如果您对 PyTorch 很熟悉,或者已经安装了它,可以忽略这部分并转到 下一节, 不然你可以按照下列步骤进行准备。
Step 0. 从 官方网址 下载并安装 Miniconda。
Step 1. 创建 conda 环境并激活
conda create --name openmmlab python=3.8 -y
conda activate openmmlab
Step 2. 按照 官方教程 安装 PyTorch, 例如
GPU 平台:
conda install pytorch torchvision -c pytorch
CPU 平台:
conda install pytorch torchvision cpuonly -c pytorch
安装¶
我们推荐用户按照我们的最优方案来安装 MMSelfSup,不过整体流程也可以是自定义的, 可参考 自定义安装 章节
最优方案¶
pip install -U openmim
mim install mmcv-full
Step 1. 安装 MMSelfSup.
实例 a: 如果您直接或者开发 MMSelfSup, 从源安装:
git clone https://github.com/open-mmlab/mmselfsup.git
cd mmselfsup
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.
实例 b: 如果您以 mmselfsup 为依赖项或者第三方库, 可使用 pip 安装:
pip install mmselfsup
安装校验¶
走完上面的步骤,为了确保您正确安装了 MMSelfSup 以及其各种依赖库,请使用下面脚本来完成校验:
import torch
from mmselfsup.models import build_algorithm
model_config = dict(
type='Classification',
backbone=dict(
type='ResNet',
depth=50,
in_channels=3,
num_stages=4,
strides=(1, 2, 2, 2),
dilations=(1, 1, 1, 1),
out_indices=[4], # 0: conv-1, x: stage-x
norm_cfg=dict(type='BN'),
frozen_stages=-1),
head=dict(
type='ClsHead', with_avg_pool=True, in_channels=2048,
num_classes=1000))
model = build_algorithm(model_config).cuda()
image = torch.randn((1, 3, 224, 224)).cuda()
label = torch.tensor([1]).cuda()
loss = model.forward_train(image, label)
如果您能顺利运行上面脚本,恭喜您已成功配置好所有环境。
自定义安装¶
基准测试¶
依照 最优方案 可以保证基本功能, 如果您需要一些下游任务来对您的预训练模型进行评测,例如检测或者分割, 请安装 MMDetection 和 MMSegmentation。
如果您不运行 MMDetection 和 MMSegmentation 基准测试, 可以不进行安装。
您可以使用以下命令进行安装:
pip install mmdet mmsegmentation
若需要更详细的信息, 您可以参考 MMDetection 和 MMSegmentation 的安装指导页面。
CUDA 版本¶
在安装 PyTorch 时, 您需要确认 CUDA 版本。 若您对此不清楚,可以按照我们的建议:
对于安培架构的 NVIDIA GPUs, 例如 GeForce 30 系列或者 NVIDIA A100, CUDA 11 是必须的。
对于较老版本的 NVIDIA GPUs, CUDA 11 是兼容的, 但是 CUDA 10.2 具有更好的兼容性以及更加轻量化。
请确认您的 GPU 驱动满足最小版本需求。 请参考 此表 获取更多信息。
注解
如果您按照我们的最优方案安装 CUDA runtime 库是足够的,因为本地不会编译 CUDA 代码。但是如果您希望从源编译 MMCV 或开发其它 CUDA 算子, 您需要安装完整的 CUDA 工具包, 从 NVIDIA 的网站,https://developer.nvidia.com/cuda-downloads,并它的版本需要和 PyTorch 的 CUDA 版本相匹配。如准确的 cudatoolkit 版本 在 conda install
命令中。
不使用 MIM 安装 MMCV¶
MMCV 包含了 C++ 和 CUDA 扩展, 因此以一种复杂的方式依赖于 PyTorch。 MIM 自动解决了这种依赖关系,并使安装更加容易,然而,这不是必须的。
使用 pip 安装 MMCV, 而不是 MIM, 请参考 MMCV 安装指南。 这需要根据 PyTorch 版本及其 CUDA 版本手动指定一个链接。
例如, 下列命令安装了 mmcv-full, 基于 PyTorch 1.10.x 和 CUDA 11.3。
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html
另一种选择: 使用 Docker¶
我们提供了一个配置好所有环境的 Dockerfile。
# build an image with PyTorch 1.6.0, CUDA 10.1, CUDNN 7.
docker build -f ./docker/Dockerfile --rm -t mmselfsup:torch1.10.0-cuda11.3-cudnn8 .
重要: 请确保您安装了 nvidia-container-toolkit。
运行下面命令:
docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/workspace/mmselfsup/data mmselfsup:torch1.10.0-cuda11.3-cudnn8 /bin/bash
{DATA_DIR}
是保存你所有数据集的根目录。
在 Google Colab 上安装¶
Google Colab 一般已经安装了 PyTorch, 因此,我们只需要使用以下命令安装 MMCV 和 MMSeflSup。
!pip3 install openmim
!mim install mmcv-full
Step 1. 安装 MMSelfSup
!git clone https://github.com/open-mmlab/mmselfsup.git
%cd mmselfsup
!pip install -e .
Step 2. 安装校验
import mmselfsup
print(mmselfsup.__version__)
# Example output: 0.9.0
注解
Within Jupyter, the exclamation mark !
is used to call external executables and %cd
is a magic command to change the current working directory of Python.
使用不同版本的 MMSelfSup¶
如果在您本地安装了多个版本的 MMSelfSup, 我们推荐您为这多个版本创建不同的虚拟环境。
另外一个方式就是在您程序的入口脚本处,插入以下代码片段 (train.py
, test.py
或则其他任何程序入口脚本)
import os.path as osp
import sys
sys.path.insert(0, osp.join(osp.dirname(osp.abspath(__file__)), '../'))
或则在不同版本的 MMSelfSup 的主目录中运行以下命令:
export PYTHONPATH="$(pwd)":$PYTHONPATH
准备数据集¶
MMSelfSup 支持多个数据集。请遵循相应的数据准备指南。建议将您的数据集根目录软链接到 $MMSELFSUP/data
。如果您的文件夹结构不同,您可能需要更改配置文件中的相应路径。
mmselfsup
├── mmselfsup
├── tools
├── configs
├── docs
├── data
│ ├── imagenet
│ │ ├── meta
│ │ ├── train
│ │ ├── val
│ ├── places205
│ │ ├── meta
│ │ ├── train
│ │ ├── val
│ ├── inaturalist2018
│ │ ├── meta
│ │ ├── train
│ │ ├── val
│ ├── VOCdevkit
│ │ ├── VOC2007
│ ├── cifar
│ │ ├── cifar-10-batches-py
准备 ImageNet 数据集¶
对于 ImageNet,它有多个版本,但最常用的是 ILSVRC 2012。可以通过以下步骤得到:
准备 iNaturalist2018 数据集¶
对于 iNaturalist2018,您需要:
从 下载页面 下载训练集和验证集图像及标注
解压下载的文件
使用脚本
tools/data_converters/convert_inaturalist.py
将原来的 json 标注格式转换为列表格式
准备 PASCAL VOC 数据集¶
假设您通常将数据集存储在 $YOUR_DATA_ROOT
中。下面的命令会自动将 PASCAL VOC 2007 下载到 $YOUR_DATA_ROOT
中,准备好所需的文件,在 $MMSELFSUP
下创建一个文件夹 data
,并制作一个软链接 VOCdevkit
。
bash tools/data_converters/prepare_voc07_cls.sh $YOUR_DATA_ROOT
准备 CIFAR10 数据集¶
如果没有找到 CIFAR10 系统将会自动下载。此外,由 MMSelfSup
实现的 dataset
也会自动将 CIFAR10 转换为适当的格式。
基础教程¶
本文档提供 MMSelfSup 相关用法的基础教程。 如果您对如何安装 MMSelfSup 以及其相关依赖库有疑问, 请参考 安装文档.
训练已有的算法¶
注意: 当您启动一个任务的时候,默认会使用8块显卡. 如果您想使用少于或多余8块显卡, 那么你的 batch size 也会同比例缩放,同时您的学习率服从一个线性缩放原则, 那么您可以使用以下公式来调整您的学习率: new_lr = old_lr * new_ngpus / old_ngpus
. 除此之外,我们推荐您使用 tools/dist_train.sh
来启动训练任务,即便您只使用一块显卡, 因为 MMSelfSup 中有些算法不支持非分布式训练。
使用 CPU 训练¶
export CUDA_VISIBLE_DEVICES=-1
python tools/train.py ${CONFIG_FILE}
注意: 我们不推荐用户使用 CPU 进行训练, 因为 CPU 的训练速度很慢,一些算法仅支持分布式训练, 例如 SyncBN
,该方法需要分布式进行训练,我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
使用 单张/多张 显卡训练¶
bash tools/dist_train.sh ${CONFIG_FILE} ${GPUS} --work-dir ${YOUR_WORK_DIR} [optional arguments]
可选参数:
--resume-from ${CHECKPOINT_FILE}
: 从某个 checkpoint 处继续训练.--deterministic
: 开启 “deterministic” 模式, 虽然开启会使得训练速度降低,但是会保证结果可复现。
例如:
# checkpoints and logs saved in WORK_DIR=work_dirs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k/
bash tools/dist_train.sh configs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k.py 8 --work_dir work_dirs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k/
注意: 在训练过程中, checkpoints 和 logs 被保存在同一目录层级下.
此外, 如果您在一个被 slurm 管理的集群中训练, 您可以使用以下的脚本开展训练:
GPUS_PER_NODE=${GPUS_PER_NODE} GPUS=${GPUS} SRUN_ARGS=${SRUN_ARGS} bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${YOUR_WORK_DIR} [optional arguments]
例如:
GPUS_PER_NODE=8 GPUS=8 bash tools/slurm_train.sh Dummy Test_job configs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k.py work_dirs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k/
使用多台机器训练¶
如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:
在第一台机器上:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
在第二台机器上:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。
如果您使用的是 slurm 来管理多台机器,您可以使用同在单台机器上一样的命令来启动任务,但是您必须得设置合适的环境变量和参数,具体可以参考slurm_train.sh。
在一台机器上启动多个任务¶
如果您想在一台机器上启动多个任务,比如说,您启动两个4卡的任务在一台8卡的机器上,您需要为每个任务指定不懂的端口来防止端口冲突。
如果您使用 dist_train.sh
来启动训练任务:
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_train.sh ${CONFIG_FILE} 4 --work-dir tmp_work_dir_1
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_train.sh ${CONFIG_FILE} 4 --work-dir tmp_work_dir_2
如果您使用 slurm 来启动训练任务,你有两种方式来为每个任务设置不同的端口:
方法 1:
在 config1.py
中, 做如下修改:
dist_params = dict(backend='nccl', port=29500)
在 config2.py
中,做如下修改:
dist_params = dict(backend='nccl', port=29501)
然后您可以通过 config1.py 和 config2.py 来启动两个不同的任务.
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
方法 2:
除了修改配置文件之外, 您可以设置 cfg-options
来重写默认的端口号:
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501
基准测试¶
我们同时提供多种命令来评估您的预训练模型, 具体您可以参考Benchmarks。
工具和建议¶
统计模型的参数¶
python tools/analysis_tools/count_parameters.py ${CONFIG_FILE}
发布模型¶
当你发布一个模型之前,您可能想做以下几件事情
将模型的参数转为 CPU tensor.
删除 optimizer 的状态参数.
计算 checkpoint 文件的哈希值,并将其添加到 checkpoint 的文件名中.
您可以使用以下命令来完整上面几件事情:
python tools/model_converters/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}
使用 t-SNE 来做模型可视化¶
我们提供了一个开箱即用的来做图片向量可视化的方法:
python tools/analysis_tools/visualize_tsne.py ${CONFIG_FILE} --checkpoint ${CKPT_PATH} --work-dir ${WORK_DIR} [optional arguments]
参数:
CONFIG_FILE
: 训练预训练模型的参数配置文件.CKPT_PATH
: 预训练模型的路径.WORK_DIR
: 保存可视化结果的路径.[optional arguments]
: 可选参数,具体可以参考 visualize_tsne.py
MAE 可视化¶
我们提供了一个对 MAE 掩码效果和重建效果可视化可视化的方法:
python tools/misc/mae_visualization.py ${IMG_PATH} ${CONFIG_FILE} ${CKPT_PATH} ${OUT_FILE} --device ${DEVICE}
参数:
IMG_PATH
: 用于可视化的图片CONFIG_FILE
: 训练预训练模型的参数配置文件.CKPT_PATH
: 预训练模型的路径.OUT_FILE
: 用于保存可视化结果的图片路径DEVICE
: 用于推理的设备.
示例:
python tools/misc/mae_visualization.py tests/data/color.jpg configs/selfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k.py mae_epoch_400.pth results.jpg --device 'cuda:0'
可复现性¶
如果您想确保模型精度的可复现性,您可以设置 --deterministic
参数。但是,开启 --deterministic
意味着关闭 torch.backends.cudnn.benchmark
, 所以会使模型的训练速度变慢。
模型库¶
所有模型和部分基准测试如下。
预训练模型¶
备注:
训练细节记录在配置文件名中。
可以点击算法名获得更加全面的信息。
基准测试¶
在下列表格中,我们只展示了基于 ImageNet 数据集的线性评估,COCO17 数据集的目标检测和实例分割以及 PASCAL VOC12 Aug 数据集的语义分割任务,您可以点击预训练模型表格中的算法名查看更多基准测试结果。
ImageNet 线性评估¶
如果没有特殊说明,下列实验采用 MoCo 的设置,或者采用的训练设置写在备注中。
ImageNet 微调¶
算法 | 配置文件 | 备注 | Top-1 (%) |
---|---|---|---|
MAE | mae_vit-base-p16_8xb512-coslr-400e_in1k | 83.1 | |
SimMIM | simmim_swin-base_16xb128-coslr-100e_in1k-192 | 82.9 | |
CAE | cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k | 83.2 | |
MaskFeat | maskfeat_vit-base-p16_8xb256-fp16-coslr-300e_in1k | 83.5 |
COCO17 目标检测和实例分割¶
在 COCO17 数据集的目标检测和实例分割任务中,我们选用 MoCo 的评估设置,基于 Mask-RCNN FPN 网络架构,下列结果通过同样的 配置文件 训练得到。
算法 | 配置文件 | mAP (Box) | mAP (Mask) |
---|---|---|---|
Relative Location | relative-loc_resnet50_8xb64-steplr-70e_in1k | 37.5 | 33.7 |
Rotation Prediction | rotation-pred_resnet50_8xb16-steplr-70e_in1k | 37.9 | 34.2 |
NPID | npid_resnet50_8xb32-steplr-200e_in1k | 38.5 | 34.6 |
SimCLR | simclr_resnet50_8xb32-coslr-200e_in1k | 38.7 | 34.9 |
MoCo v2 | mocov2_resnet50_8xb32-coslr-200e_in1k | 40.2 | 36.1 |
BYOL | byol_resnet50_8xb32-accum16-coslr-200e_in1k | 40.9 | 36.8 |
SwAV | swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 40.2 | 36.3 |
SimSiam | simsiam_resnet50_8xb32-coslr-100e_in1k | 38.6 | 34.6 |
simsiam_resnet50_8xb32-coslr-200e_in1k | 38.8 | 34.9 |
Pascal VOC12 Aug 语义分割¶
在 Pascal VOC12 Aug 语义分割任务中,我们选用 MMSeg 的评估设置, 基于 FCN 网络架构, 下列结果通过同样的 配置文件 训练得到。
算法 | 配置文件 | mIOU |
---|---|---|
Relative Location | relative-loc_resnet50_8xb64-steplr-70e_in1k | 63.49 |
Rotation Prediction | rotation-pred_resnet50_8xb16-steplr-70e_in1k | 64.31 |
NPID | npid_resnet50_8xb32-steplr-200e_in1k | 65.45 |
SimCLR | simclr_resnet50_8xb32-coslr-200e_in1k | 64.03 |
MoCo v2 | mocov2_resnet50_8xb32-coslr-200e_in1k | 67.55 |
BYOL | byol_resnet50_8xb32-accum16-coslr-200e_in1k | 67.16 |
SwAV | swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 63.73 |
DenseCL | densecl_resnet50_8xb32-coslr-200e_in1k | 69.47 |
SimSiam | simsiam_resnet50_8xb32-coslr-100e_in1k | 48.35 |
simsiam_resnet50_8xb32-coslr-200e_in1k | 46.27 |
教程 0: 学习配置¶
MMSelfSup 主要使用python文件作为配置。我们设计的配置文件系统集成了模块化和继承性,方便用户实施各种实验。所有的配置文件都放在 configs
文件夹。如果你想概要地审视配置文件,你可以执行 python tools/misc/print_config.py
查看完整配置。
配置文件与检查点命名约定¶
我们遵循下述约定来命名配置文件并建议贡献者也遵循该命名风格。配置文件名字被分成4部分:算法信息、模块信息、训练信息和数据信息。逻辑上,不同部分用下划线连接 '_'
,同一部分中的单词使用破折线 '-'
连接。
{algorithm}_{module}_{training_info}_{data_info}.py
algorithm info
:包含算法名字的算法信息,例如simclr,mocov2等;module info
: 模块信息,用来表示一些 backbone,neck 和 head 信息;training info
:训练信息,即一些训练调度,包括批大小,学习率调度,数据增强等;data info
:数据信息:数据集名字,输入大小等,例如 imagenet,cifar 等。
算法信息¶
{algorithm}-{misc}
Algorithm
表示论文中的算法缩写和版本。例如:
relative-loc
:不同单词之间使用破折线连接'-'
simclr
mocov2
misc
提供一些其他算法相关信息。例如:
npid-ensure-neg
deepcluster-sobel
模块信息¶
{backbone setting}-{neck setting}-{head_setting}
模块信息主要包含 backboe 信息。例如:
resnet50
vit
(将会用在mocov3中)
或者其他一些需要在配置名字中强调的特殊的设置。例如:
resnet50-nofrz
:在一些下游任务的训练中,该 backbone 不会冻结 stages
训练信息¶
训练相关的配置,包括 batch size, lr schedule, data augment 等。
Batch size,格式是
{gpu x batch_per_gpu}
,例如8xb32
;Training recipe,该方法以如下顺序组织:
{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}
例如:
8xb32-mcrop-2-6-coslr-200e
:mcrop
是 SwAV 提出的 pipeline 中的名为 multi-crop 的一部分。2 和 6 表示 2 个 pipeline 分别输出 2 个和 6 个裁剪图,而且裁剪信息记录在数据信息中;8xb32-accum16-coslr-200e
:accum16
表示权重会在梯度累积16个迭代之后更新。
数据信息¶
数据信息包含数据集,输入大小等。例如:
in1k
:ImageNet1k
数据集,默认使用的输入图像大小是 224x224in1k-384px
:表示输入图像大小是384x384cifar10
inat18
:iNaturalist2018
数据集,包含 8142 类places205
配置文件命名示例¶
swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96.py
swav
:算法信息resnet50
:模块信息8xb32-mcrop-2-6-coslr-200e
:训练信息8xb32
:共使用 8 张 GPU,每张 GPU 上的 batch size 是 32mcrop-2-6
:使用 multi-crop 数据增强方法coslr
:使用余弦学习率调度器200e
:训练模型200个周期
in1k-224-96
:数据信息,在 ImageNet1k 数据集上训练,输入大小是 224x224 和 96x96
配置文件结构¶
在 configs/_base_
文件中,有 4 种类型的基础组件文件,即
models
datasets
schedules
runtime
你可以通过继承一些基础配置文件快捷地构建你自己的配置。由 _base_
下的组件组成的配置被称为 原始配置(primitive)。
为了易于理解,我们使用 MoCo v2 作为一个例子,并对它的每一行做出注释。若想了解更多细节,请参考 API 文档。
配置文件 configs/selfsup/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
如下所述。
_base_ = [
'../_base_/models/mocov2.py', # 模型
'../_base_/datasets/imagenet_mocov2.py', # 数据
'../_base_/schedules/sgd_coslr-200e_in1k.py', # 训练调度
'../_base_/default_runtime.py', # 运行时设置
]
# 在这里,我们继承运行时设置并修改 max_keep_ckpts。
# max_keep_ckpts 控制在你的 work_dirs 中最大的ckpt文件的数量
# 如果它是3,当 CheckpointHook (在mmcv中) 保存第 4 个 ckpt 时,
# 它会移除最早的那个,使总的 ckpt 文件个数保持为 3
checkpoint_config = dict(interval=10, max_keep_ckpts=3)
注解
配置文件中的 ‘type’ 是一个类名,而不是参数的一部分。
../_base_/models/mocov2.py
是 MoCo v2 的基础模型配置。
model = dict(
type='MoCo', # 算法名字
queue_len=65536, # 队列中维护的负样本数量
feat_dim=128, # 紧凑特征向量的维度,等于 neck 的 out_channels
momentum=0.999, # 动量更新编码器的动量系数
backbone=dict(
type='ResNet', # Backbone name
depth=50, # backbone 深度,ResNet 可以选择 18、34、50、101、 152
in_channels=3, # 输入图像的通道数
out_indices=[4], # 输出特征图的输出索引,0 表示 conv-1,x 表示 stage-x
norm_cfg=dict(type='BN')), # 构建一个字典并配置 norm 层
neck=dict(
type='MoCoV2Neck', # Neck name
in_channels=2048, # 输入通道数
hid_channels=2048, # 隐层通道数
out_channels=128, # 输出通道数
with_avg_pool=True), # 是否在 backbone 之后使用全局平均池化
head=dict(
type='ContrastiveHead', # Head name, 表示 MoCo v2 使用 contrastive loss
temperature=0.2)) # 控制分布聚集程度的温度超参数
../_base_/datasets/imagenet_mocov2.py
是 MoCo v2 的基础数据集配置。
# 数据集配置
data_source = 'ImageNet' # 数据源名字
dataset_type = 'MultiViewDataset' # 组成 pipeline 的数据集类型
img_norm_cfg = dict(
mean=[0.485, 0.456, 0.406], # 用来预训练预训练 backboe 模型的均值
std=[0.229, 0.224, 0.225]) # 用来预训练预训练 backbone 模型的标准差
# mocov2 和 mocov1 之间的差异在于 pipeline 中的 transforms
train_pipeline = [
dict(type='RandomResizedCrop', size=224, scale=(0.2, 1.)), # RandomResizedCrop
dict(
type='RandomAppliedTrans', # 以0.8的概率随机使用 ColorJitter 增强方法
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.4,
hue=0.1)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2), # 0.2概率的 RandomGrayscale
dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=0.5), # 0.5概率的随机 GaussianBlur
dict(type='RandomHorizontalFlip'), # 随机水平翻转图像
]
# prefetch
prefetch = False # 是否使用 prefetch 加速 pipeline
if not prefetch:
train_pipeline.extend(
[dict(type='ToTensor'),
dict(type='Normalize', **img_norm_cfg)])
# 数据集汇总
data = dict(
samples_per_gpu=32, # 单张 GPU 的批大小, 共 32*8=256
workers_per_gpu=4, # 每张 GPU 用来 pre-fetch 数据的 worker 个数
drop_last=True, # 是否丢弃最后一个 batch 的数据
train=dict(
type=dataset_type, # 数据集名字
data_source=dict(
type=data_source, # 数据源名字
data_prefix='data/imagenet/train', # 数据集根目录, 当 ann_file 不存在时,类别信息自动从该根目录自动获取
ann_file='data/imagenet/meta/train.txt', # 若 ann_file 存在,类别信息从该文件获取
),
num_views=[2], # pipeline 中不同的视图个数
pipelines=[train_pipeline], # 训练 pipeline
prefetch=prefetch, # 布尔值
))
../_base_/schedules/sgd_coslr-200e_in1k.py
是 MoCo v2 的基础调度配置。
# 优化器
optimizer = dict(
type='SGD', # 优化器类型
lr=0.03, # 优化器的学习率, 参数的详细使用请参阅 PyTorch 文档
weight_decay=1e-4, # 动量参数
momentum=0.9) # SGD 的权重衰减
# 用来构建优化器钩子的配置,请参考 https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py#L8 中的实现细节。
optimizer_config = dict() # 这个配置可以设置 grad_clip,coalesce,bucket_size_mb 等。
# 学习策略
# 用来注册 LrUpdater 钩子的学习率调度配置
lr_config = dict(
policy='CosineAnnealing', # 调度器策略,也支持 Step,Cyclic 等。 LrUpdater 支持的细节请参考 https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/lr_updater.py#L9。
min_lr=0.) # CosineAnnealing 中的最小学习率设置
# 运行时设置
runner = dict(
type='EpochBasedRunner', # 使用的 runner 的类型 (例如 IterBasedRunner 或 EpochBasedRunner)
max_epochs=200) # 运行工作流周期总数的 Runner 的 max_epochs,对于IterBasedRunner 使用 `max_iters`
../_base_/default_runtime.py
是运行时的默认配置。
# 保存检查点
checkpoint_config = dict(interval=10) # 保存间隔是10
# yapf:disable
log_config = dict(
interval=50, # 打印日志的间隔
hooks=[
dict(type='TextLoggerHook'), # 也支持 Tensorboard logger
# dict(type='TensorboardLoggerHook'),
])
# yapf:enable
# 运行时设置
dist_params = dict(backend='nccl') # 设置分布式训练的参数,端口也支持设置。
log_level = 'INFO' # 日志的输出 level。
load_from = None # 加载 ckpt
resume_from = None # 从给定的路径恢复检查点,将会从检查点保存时的周期恢复训练。
workflow = [('train', 1)] # Workflow for runner. [('train', 1)] 表示有一个 workflow,该 workflow 名字是 'train' 且执行一次。
persistent_workers = True # Dataloader 中设置 persistent_workers 的布尔值,详细信息请参考 PyTorch 文档
继承和修改配置文件¶
为了易于理解,我们推荐贡献者从现有方法继承。
对于同一个文件夹下的所有配置,我们推荐只使用一个 原始(primitive) 配置。其他所有配置应当从 原始(primitive) 配置继承,这样最大的继承层次为 3。
例如,如果你的配置文件是基于 MoCo v2 做一些修改,首先你可以通过指定 _base_ ='./mocov2_resnet50_8xb32-coslr-200e_in1k.py.py'
(相对于你的配置文件的路径)继承基本的 MoCo v2 结构,数据集和其他训练设置,接着在配置文件中修改一些必要的参数。现在,我们举一个更具体的例子,我们想使用 configs/selfsup/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py.py
中几乎所有的配置,但是将训练周期数从 200 修改为 800,修改学习率衰减的时机和数据集路径,你可以创建一个名为 configs/selfsup/mocov2/mocov2_resnet50_8xb32-coslr-800e_in1k.py.py
的新配置文件,内容如下:
_base_ = './mocov2_resnet50_8xb32-coslr-200e_in1k.py'
runner = dict(max_epochs=800)
使用配置中的中间变量¶
在配置文件中使用一些中间变量会使配置文件更加清晰和易于修改。
例如:数据中的中间变量有 data_source
, dataset_type
, train_pipeline
, prefetch
. 我们先定义它们再将它们传进 data
。
data_source = 'ImageNet'
dataset_type = 'MultiViewDataset'
img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
train_pipeline = [...]
# prefetch
prefetch = False # 是否使用 prefetch 加速 pipeline
if not prefetch:
train_pipeline.extend(
[dict(type='ToTensor'),
dict(type='Normalize', **img_norm_cfg)])
# dataset summary
data = dict(
samples_per_gpu=32,
workers_per_gpu=4,
drop_last=True,
train=dict(type=dataset_type, type=data_source, data_prefix=...),
num_views=[2],
pipelines=[train_pipeline],
prefetch=prefetch,
))
忽略基础配置中的字段¶
有时候,你需要设置 _delete_=True
来忽略基础配置文件中一些域的内容。 你可以参考 mmcv 获得更多说明。
接下来是一个例子。如果你希望在 simclr 的设置中使用 MoCoV2Neck
,仅仅继承并直接修改将会报 get unexcepected keyword 'num_layers'
错误,因为在 model.neck
域信息中,基础配置 'num_layers'
字段被保存下来了, 你需要添加 _delete_=True
来忽略 model.neck
在基础配置文件中的有关字段的内容。
_base_ = 'simclr_resnet50_8xb32-coslr-200e_in1k.py'
model = dict(
neck=dict(
_delete_=True,
type='MoCoV2Neck',
in_channels=2048,
hid_channels=2048,
out_channels=128,
with_avg_pool=True))
使用基础配置中的字段¶
有时候,你可能引用 _base_
配置中一些字段,以避免重复定义。你可以参考mmcv 获取更多的说明。
下面是在训练数据预处理 pipeline 中使用 auto augment 的一个例子,请参考 configs/selfsup/odc/odc_resnet50_8xb64-steplr-440e_in1k.py
。当定义 num_classes
时,只需要将 auto augment 的定义文件名添入到 _base_
,并使用 {{_base_.num_classes}}
来引用这些变量:
_base_ = [
'../_base_/models/odc.py',
'../_base_/datasets/imagenet_odc.py',
'../_base_/schedules/sgd_steplr-200e_in1k.py',
'../_base_/default_runtime.py',
]
# model settings
model = dict(
head=dict(num_classes={{_base_.num_classes}}),
memory_bank=dict(num_classes={{_base_.num_classes}}),
)
# optimizer
optimizer = dict(
type='SGD',
lr=0.06,
momentum=0.9,
weight_decay=1e-5,
paramwise_options={'\\Ahead.': dict(momentum=0.)})
# learning policy
lr_config = dict(policy='step', step=[400], gamma=0.4)
# runtime settings
runner = dict(type='EpochBasedRunner', max_epochs=440)
# max_keep_ckpts 控制在你的 work_dirs 中保存的 ckpt 的最大数目
# 如果它等于3,CheckpointHook(在mmcv中)在保存第 4 个 ckpt 时,
# 它会移除最早的那个,使总的 ckpt 文件个数保持为 3
checkpoint_config = dict(interval=10, max_keep_ckpts=3)
通过脚本参数修改配置¶
当用户使用脚本 “tools/train.py” 或 “tools/test.py” 提交任务,或者其他工具时,可以通过指定 --cfg-options
参数来直接修改配置文件中内容。
更新字典链中的配置的键
配置项可以通过遵循原始配置中键的层次顺序指定。例如,
--cfg-options model.backbone.norm_eval=False
改变模型 backbones 中的所有 BN 模块为train
模式。更新列表中配置的键
你的配置中的一些配置字典是由列表组成。例如,训练 pipeline
data.train.pipeline
通常是一个列表。例如[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...]
。如果你想要在 pipeline 中将'flip_prob=0.5'
修改为'flip_prob=0.0'
,你可以指定--cfg-options data.train.pipeline.1.flip_prob=0.0
更新 list/tuples 中的值
如果想要更新的值是一个列表或者元组,例如:配置文件通常设置
workflow=[('train', 1)]
。如果你想要改变这个键,你可以指定--cfg-options workflow="[(train,1),(val,1)]"
。注意:对于 list/tuple 数据类型,引号” 是必须的,并且在指定值的时候,在引号中 NO 空白字符。
导入用户定义模块¶
注解
这部分内容初学者可以跳过,只在使用其他 MM-codebase 时会用到,例如使用 mmcls 作为第三方库来构建你的工程。
你可能使用其他的 MM-codebase 来完成你的工程,并在工程中创建新的数据集类,模型类,数据增强类等。为了简化代码,你可以使用 MM-codebase 作为第三方库,只需要保存你自己额外的代码,并在配置文件中导入自定义模块。你可以参考 OpenMMLab Algorithm Competition Project 中的例子。
在你自己的配置文件中添加如下所述的代码:
custom_imports = dict(
imports=['your_dataset_class',
'your_transforme_class',
'your_model_class',
'your_module_class'],
allow_failed_imports=False)
教程 1: 添加新的数据格式¶
在本节教程中,我们将介绍创建自定义数据格式的基本步骤:
如果你的算法不需要任何定制的数据格式,你可以使用datasets目录中这些现成的数据格式。但是要使用这些现有的数据格式,你必须将你的数据集转换为现有的数据格式。
自定义数据格式示例¶
假设你的数据集的注释文件格式是:
000001.jpg 0
000002.jpg 1
要编写一个新的数据格式,你需要实现:
子类
DataSource
:继承自父类BaseDataSource
——负责加载注释文件和读取图像。子类
Dataset
:继承自父类BaseDataset
——负责对图像进行转换和打包。
创建 DataSource
子类¶
假设你基于父类DataSource
创建的子类名为 NewDataSource
, 你可以在mmselfsup/datasets/data_sources
目录下创建一个文件,文件名为 new_data_source.py
,并在这个文件中实现 NewDataSource
创建。
import mmcv
import numpy as np
from ..builder import DATASOURCES
from .base import BaseDataSource
@DATASOURCES.register_module()
class NewDataSource(BaseDataSource):
def load_annotations(self):
assert isinstance(self.ann_file, str)
data_infos = []
# writing your code here.
return data_infos
然后, 在 mmselfsup/dataset/data_sources/__init__.py
中添加NewDataSource
。
from .base import BaseDataSource
...
from .new_data_source import NewDataSource
__all__ = [
'BaseDataSource', ..., 'NewDataSource'
]
创建 Dataset
子类¶
假设你基于父类 Dataset
创建的子类名为 NewDataset
,你可以在mmselfsup/datasets
目录下创建一个文件,文件名为new_dataset.py
,并在这个文件中实现 NewDataset
创建。
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmcv.utils import build_from_cfg
from torchvision.transforms import Compose
from .base import BaseDataset
from .builder import DATASETS, PIPELINES, build_datasource
from .utils import to_numpy
@DATASETS.register_module()
class NewDataset(BaseDataset):
def __init__(self, data_source, num_views, pipelines, prefetch=False):
# writing your code here
def __getitem__(self, idx):
# writing your code here
return dict(img=img)
def evaluate(self, results, logger=None):
return NotImplemented
然后,在 mmselfsup/dataset/__init__.py
中添加 NewDataset
。
from .base import BaseDataset
...
from .new_dataset import NewDataset
__all__ = [
'BaseDataset', ..., 'NewDataset'
]
修改配置文件¶
为了使用 NewDataset
,你可以修改配置如下:
train=dict(
type='NewDataset',
data_source=dict(
type='NewDataSource',
),
num_views=[2],
pipelines=[train_pipeline],
prefetch=prefetch,
))
教程 2:自定义数据管道¶
教程 2:自定义数据管道
Pipeline
概览在
Pipeline
中创建新的数据增强
Pipeline
概览¶
DataSource
和 Pipeline
是 Dataset
的两个重要组件。我们已经在 add_new_dataset 中介绍了 DataSource
。 Pipeline
负责对图像进行一系列的数据增强,例如随机翻转。
这是用于 SimCLR
训练的 Pipeline
的配置示例:
train_pipeline = [
dict(type='RandomResizedCrop', size=224),
dict(type='RandomHorizontalFlip'),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='ColorJitter',
brightness=0.8,
contrast=0.8,
saturation=0.8,
hue=0.2)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2),
dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=0.5)
]
Pipeline
中的每个增强都接收一张图像作为输入,并输出一张增强后的图像。
在 Pipeline
中创建新的数据增强¶
1.在 transforms.py 中编写一个新的数据增强函数,并覆盖 __call__
函数,该函数接收一张 Pillow
图像作为输入:
@PIPELINES.register_module()
class MyTransform(object):
def __call__(self, img):
# apply transforms on img
return img
2.在配置文件中使用它。我们重新使用上面的配置文件,并在其中添加 MyTransform
。
train_pipeline = [
dict(type='RandomResizedCrop', size=224),
dict(type='RandomHorizontalFlip'),
dict(type='MyTransform'),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='ColorJitter',
brightness=0.8,
contrast=0.8,
saturation=0.8,
hue=0.2)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2),
dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=0.5)
]
教程 3:添加新的模块¶
教程 3:添加新的模块
添加新的 Necks
在自监督学习领域,每个模型可以被分为以下四个部分:
backbone:用于提取图像特征。
projection head:将 backbone 提取的特征映射到另一空间。
loss:用于模型优化的损失函数。
memory bank(可选):一些方法(例如
odc
),需要额外的 memory bank 用于存储图像特征。
添加新的 backbone¶
假设我们要创建一个自定义的 backbone CustomizedBackbone
。
1.创建新文件 mmselfsup/models/backbones/customized_backbone.py
并在其中实现 CustomizedBackbone
。
import torch.nn as nn
from ..builder import BACKBONES
@BACKBONES.register_module()
class CustomizedBackbone(nn.Module):
def __init__(self, **kwargs):
## TODO
def forward(self, x):
## TODO
def init_weights(self, pretrained=None):
## TODO
def train(self, mode=True):
## TODO
2.在 mmselfsup/models/backbones/__init__.py
中导入自定义的 backbone。
from .customized_backbone import CustomizedBackbone
__all__ = [
..., 'CustomizedBackbone'
]
3.在你的配置文件中使用它。
model = dict(
...
backbone=dict(
type='CustomizedBackbone',
...),
...
)
添加新的 Necks¶
我们在 mmselfsup/models/necks
中包含了所有的 projection heads。假设我们要创建一个 CustomizedProjHead
。
1.创建一个新文件 mmselfsup/models/necks/customized_proj_head.py
并在其中实现 CustomizedProjHead
。
import torch.nn as nn
from mmcv.runner import BaseModule
from ..builder import NECKS
@NECKS.register_module()
class CustomizedProjHead(BaseModule):
def __init__(self, *args, **kwargs):
super(CustomizedProjHead, self).__init__(init_cfg)
## TODO
def forward(self, x):
## TODO
你需要实现前向函数,该函数从 backbone 中获取特征,并输出映射后的特征。
2.在 mmselfsup/models/necks/__init__
中导入 CustomizedProjHead
。
from .customized_proj_head import CustomizedProjHead
__all__ = [
...,
CustomizedProjHead,
...
]
3.在你的配置文件中使用它。
model = dict(
...,
neck=dict(
type='CustomizedProjHead',
...),
...)
添加新的损失¶
为了增加一个新的损失函数,我们主要在损失模块中实现 forward
函数。
1.创建一个新的文件 mmselfsup/models/heads/customized_head.py
并在其中实现你自定义的 CustomizedHead
。
import torch
import torch.nn as nn
from mmcv.runner import BaseModule
from ..builder import HEADS
@HEADS.register_module()
class CustomizedHead(BaseModule):
def __init__(self, *args, **kwargs):
super(CustomizedHead, self).__init__()
## TODO
def forward(self, *args, **kwargs):
## TODO
2.在 mmselfsup/models/heads/__init__.py
中导入该模块。
from .customized_head import CustomizedHead
__all__ = [..., CustomizedHead, ...]
3.在你的配置文件中使用它。
model = dict(
...,
head=dict(type='CustomizedHead')
)
合并所有改动¶
在创建了上述每个组件后,我们需要创建一个 CustomizedAlgorithm
来有逻辑的将他们组织到一起。 CustomizedAlgorithm
接收原始图像作为输入,并将损失输出给优化器。
1.创建一个新文件 mmselfsup/models/algorithms/customized_algorithm.py
并在其中实现 CustomizedAlgorithm
。
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from ..builder import ALGORITHMS, build_backbone, build_head, build_neck
from ..utils import GatherLayer
from .base import BaseModel
@ALGORITHMS.register_module()
class CustomizedAlgorithm(BaseModel):
def __init__(self, backbone, neck=None, head=None, init_cfg=None):
super(SimCLR, self).__init__(init_cfg)
## TODO
def forward_train(self, img, **kwargs):
## TODO
2.在 mmselfsup/models/algorithms/__init__.py
中导入该模块。
from .customized_algorithm import CustomizedAlgorithm
__all__ = [..., CustomizedAlgorithm, ...]
3.在你的配置文件中使用它。
model = dict(
type='CustomizedAlgorightm',
backbone=...,
neck=...,
head=...)
教程 4:自定义优化策略¶
教程 4:自定义优化策略
在本教程中,我们将介绍如何在运行自定义模型时,进行构造优化器、定制学习率、动量调整策略、参数化精细配置、梯度裁剪、梯度累计以及用户自定义优化方法等。
构造 PyTorch 内置优化器¶
我们已经支持使用PyTorch实现的所有优化器,要使用和修改这些优化器,请修改配置文件中的optimizer
字段。
例如,如果您想使用SGD,可以进行如下修改。
optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)
要修改模型的学习率,只需要在优化器的配置中修改 lr
即可。 要配置其他参数,可直接根据 PyTorch API 文档进行。
例如,如果想使用 Adam
并设置参数为 torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
, 则需要进行如下配置
optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
除了PyTorch实现的优化器之外,我们还在 mmselfsup/core/optimizer/optimizers.py
中构造了一个LARS。
定制学习率调整策略¶
定制学习率衰减曲线¶
深度学习研究中,广泛应用学习率衰减来提高网络的性能。要使用学习率衰减,可以在配置中设置 lr_confg
字段。
例如,在 SimCLR 网络训练中,我们使用 CosineAnnealing 的学习率衰减策略,配置文件为:
lr_config = dict(
policy='CosineAnnealing',
...)
在训练过程中,程序会周期性地调用 MMCV 中的 CosineAnealingLrUpdaterHook 来进行学习率更新。
此外,我们也支持其他学习率调整方法,如 Poly
等。详情可见 这里
定制学习率预热策略¶
在训练的早期阶段,网络容易不稳定,而学习率的预热就是为了减少这种不稳定性。通过预热,学习率将会从一个很小的值逐步提高到预定值。
在 MMSelfSup 中,我们同样使用 lr_config
配置学习率预热策略,主要的参数有以下几个:
warmup
: 学习率预热曲线类别,必须为 ‘constant’、 ‘linear’, ‘exp’ 或者None
其一, 如果为None
, 则不使用学习率预热策略。warmup_by_epoch
: 是否以轮次(epoch)为单位进行预热,默认为 True 。如果被设置为 False , 则以 iter 为单位进行预热。warmup_iters
: 预热的迭代次数,当warmup_by_epoch=True
时,单位为轮次(epoch);当warmup_by_epoch=False
时,单位为迭代次数(iter)。warmup_ratio
: 预热的初始学习率lr = lr * warmup_ratio
。
例如:
1.逐迭代次数地线性预热
lr_config = dict(
policy='CosineAnnealing',
by_epoch=False,
min_lr_ratio=1e-2,
warmup='linear',
warmup_ratio=1e-3,
warmup_iters=20 * 1252,
warmup_by_epoch=False)
2.逐轮次地指数预热
lr_config = dict(
policy='CosineAnnealing',
min_lr=0,
warmup='exp',
warmup_iters=5,
warmup_ratio=0.1,
warmup_by_epoch=True)
定制动量调整策略¶
我们支持动量调整器根据学习率修改模型的动量,从而使模型收敛更快。
动量调整策略通常与学习率调整策略一起使用,例如,以下配置用于加速收敛。更多细节可参考 CyclicLrUpdater 和 CyclicMomentumUpdater。
例如:
lr_config = dict(
policy='cyclic',
target_ratio=(10, 1e-4),
cyclic_times=1,
step_ratio_up=0.4,
)
momentum_config = dict(
policy='cyclic',
target_ratio=(0.85 / 0.95, 1),
cyclic_times=1,
step_ratio_up=0.4,
)
参数化精细配置¶
一些模型的优化策略,包含作用于特定参数的精细设置,例如 BatchNorm 层不添加权重衰减或者对不同的网络层使用不同的学习率。为了进行精细配置,我们通过 optimizer
中的 paramwise_options
参数进行配置。
例如,如果我们不想对 BatchNorm 或 GroupNorm 的参数以及各层的 bias 应用权重衰减,我们可以使用以下配置文件:
optimizer = dict(
type=...,
lr=...,
paramwise_options={
'(bn|gn)(\\d+)?.(weight|bias)':
dict(weight_decay=0.),
'bias': dict(weight_decay=0.)
})
梯度裁剪与梯度累计¶
梯度裁剪¶
除了 PyTorch 优化器的基本功能,我们还提供了一些增强功能,例如梯度裁剪、梯度累计等。更多细节参考 MMCV。
目前我们支持在 optimizer_config
字段中添加 grad_clip
参数来进行梯度裁剪,更详细的参数可参考 PyTorch 文档。
用例如下:
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# norm_type: 使用的范数类型,此处使用范数2。
当使用继承并修改基础配置时,如果基础配置中 grad_clip=None
,需要添加 _delete_=True
。
梯度累计¶
计算资源缺乏时,每个批次的大小(batch size)只能设置为较小的值,这可能会影响模型的性能。可以使用梯度累计来规避这一问题。
用例如下:
data = dict(samples_per_gpu=64)
optimizer_config = dict(type="DistOptimizerHook", update_interval=4)
表示训练时,每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64,也就等价于单张 GPU 上一次迭代的批次大小为 256,也即:
data = dict(samples_per_gpu=256)
optimizer_config = dict(type="OptimizerHook")
用户自定义优化方法¶
在学术研究和工业实践中,可能需要使用 MMSelfSup 未实现的优化方法,可以通过以下方法添加。
在 mmselfsup/core/optimizer/optimizers.py
中实现您的 CustomizedOptim
。
import torch
from torch.optim import * # noqa: F401,F403
from torch.optim.optimizer import Optimizer, required
from mmcv.runner.optimizer.builder import OPTIMIZERS
@OPTIMIZER.register_module()
class CustomizedOptim(Optimizer):
def __init__(self, *args, **kwargs):
## TODO
@torch.no_grad()
def step(self):
## TODO
修改 mmselfsup/core/optimizer/__init__.py
,将其导入
from .optimizers import CustomizedOptim
from .builder import build_optimizer
__all__ = ['CustomizedOptim', 'build_optimizer', ...]
在配置文件中指定优化器
optimizer = dict(
type='CustomizedOptim',
...
)
教程 5:自定义模型运行参数¶
教程 5:自定义模型运行参数
在本教程中,我们将介绍如何在运行自定义模型时,进行自定义工作流和钩子的方法。
定制工作流¶
工作流是一个形如 (任务名,周期数) 的列表,用于指定运行顺序和周期。这里“周期数”的单位由执行器的类型来决定。
比如,我们默认使用基于轮次的执行器(EpochBasedRunner
),那么“周期数”指的就是对应的任务在一个周期中要执行多少个轮次。通常,我们只希望执行训练任务,那么只需要使用以下设置:
workflow = [('train', 1)]
有时我们可能希望在训练过程中穿插检查模型在验证集上的一些指标(例如,损失,准确率)。在这种情况下,可以将工作流程设置为:
[('train', 1), ('val', 1)]
这样一来,程序会一轮训练一轮验证地反复执行。
默认情况下,我们更推荐在每个训练轮次后使用 EvalHook
进行模型验证。
钩子¶
钩子机制在 OpenMMLab 开源算法库中应用非常广泛,结合执行器可以实现对训练过程的整个生命周期进行管理,可以通过相关文章进一步理解钩子。
钩子只有被注册进执行器才起作用,目前钩子主要分为两类:
默认训练钩子
默认训练钩子由运行器默认注册,一般为一些基础型功能的钩子,已经有确定的优先级,一般不需要修改优先级。
定制钩子
定制钩子通过 custom_hooks
注册,一般为一些增强型功能的钩子,需要在配置文件中指定优先级,不指定该钩子的优先级将默被设定为 ‘NORMAL’。
优先级列表
Level | Value |
---|---|
HIGHEST | 0 |
VERY_HIGH | 10 |
HIGH | 30 |
ABOVE_NORMAL | 40 |
NORMAL(default) | 50 |
BELOW_NORMAL | 60 |
LOW | 70 |
VERY_LOW | 90 |
LOWEST | 100 |
优先级确定钩子的执行顺序,每次训练前,日志会打印出各个阶段钩子的执行顺序,方便调试。
默认训练钩子¶
有一些常见的钩子未通过 custom_hooks
注册,但会在运行器(Runner
)中默认注册,它们是:
Hooks | Priority |
---|---|
LrUpdaterHook |
VERY_HIGH (10) |
MomentumUpdaterHook |
HIGH (30) |
OptimizerHook |
ABOVE_NORMAL (40) |
CheckpointHook |
NORMAL (50) |
IterTimerHook |
LOW (70) |
EvalHook |
LOW (70) |
LoggerHook(s) |
VERY_LOW (90) |
OptimizerHook
,MomentumUpdaterHook
和 LrUpdaterHook
在 优化策略 部分进行了介绍, IterTimerHook
用于记录所用时间,目前不支持修改。
下面介绍如何使用去定制 CheckpointHook
、LoggerHooks
以及 EvalHook
。
权重文件钩子 CheckpointHook¶
MMCV 的 runner 使用 checkpoint_config
来初始化 CheckpointHook
。
checkpoint_config = dict(interval=1)
用户可以设置 max_keep_ckpts
来仅保存少量模型权重文件,或者通过 save_optimizer
决定是否存储优化器的状态字典。更多细节可参考 这里。
日志钩子 LoggerHooks¶
log_config
包装了多个记录器钩子,并可以设置间隔。
目前,MMCV 支持 TextLoggerHook
、 WandbLoggerHook
、MlflowLoggerHook
、 NeptuneLoggerHook
、 DvcliveLoggerHook
和 TensorboardLoggerHook
。
更多细节可参考这里。
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')
])
验证钩子 EvalHook¶
配置中的 evaluation
字段将用于初始化 EvalHook
.
EvalHook
有一些保留参数,如 interval
,save_best
和 start
等。其他的参数,如 metrics
将被传递给 dataset.evaluate()
。
evaluation = dict(interval=1, metric='accuracy', metric_options={'topk': (1, )})
我们可以通过参数 save_best
保存取得最好验证结果时的模型权重:
# "auto" 表示自动选择指标来进行模型的比较。
# 也可以指定一个特定的 key 比如 "accuracy_top-1"。
evaluation = dict(interval=1, save_best="auto", metric='accuracy', metric_options={'topk': (1, )})
在跑一些大型实验时,可以通过修改参数 start
跳过训练靠前轮次时的验证步骤,以节约时间。如下:
evaluation = dict(interval=1, start=200, metric='accuracy', metric_options={'topk': (1, )})
表示在第 200 轮之前,只执行训练流程,不执行验证;从轮次 200 开始,在每一轮训练之后进行验证。
使用其他内置钩子¶
一些钩子已在 MMCV 和 MMClassification 中实现:
如果要用的钩子已经在MMCV中实现,可以直接修改配置以使用该钩子,如下格式:
mmcv_hooks = [
dict(type='MMCVHook', a=a_value, b=b_value, priority='NORMAL')
]
例如使用 EMAHook
,进行一次 EMA 的间隔是100个 iter:
custom_hooks = [
dict(type='EMAHook', interval=100, priority='HIGH')
]
自定义钩子¶
1. 创建一个新钩子¶
这里举一个在 MMSelfSup 中创建一个新钩子的示例:
from mmcv.runner import HOOKS, Hook
@HOOKS.register_module()
class MyHook(Hook):
def __init__(self, a, b):
pass
def before_run(self, runner):
pass
def after_run(self, runner):
pass
def before_epoch(self, runner):
pass
def after_epoch(self, runner):
pass
def before_iter(self, runner):
pass
def after_iter(self, runner):
pass
根据钩子的功能,用户需要指定钩子在训练的每个阶段将要执行的操作,比如 before_run
,after_run
,before_epoch
,after_epoch
,before_iter
和 after_iter
。
2. 导入新钩子¶
之后,需要导入 MyHook
。假设该文件在 mmselfsup/core/hooks/my_hook.py
,有两种办法导入它:
修改
mmselfsup/core/hooks/__init__.py
进行导入,如下:
from .my_hook import MyHook
__all__ = [..., MyHook, ...]
使用配置文件中的
custom_imports
变量手动导入
custom_imports = dict(imports=['mmselfsup.core.hooks.my_hook'], allow_failed_imports=False)
3. 修改配置¶
custom_hooks = [
dict(type='MyHook', a=a_value, b=b_value)
]
还可通过 priority
参数设置钩子优先级,如下所示:
custom_hooks = [
dict(type='MyHook', a=a_value, b=b_value, priority='ABOVE_NORMAL')
]
默认情况下,在注册过程中,钩子的优先级设置为 NORMAL
。
教程 6:运行基准评测¶
在 MMSelfSup 中,我们提供了许多基准评测,因此模型可以在不同的下游任务中进行评估。这里提供了全面的教程和例子来解释如何用 MMSelfSup 运行所有的基准。
教程 6:运行基准评测
首先,你应该通过tools/model_converters/extract_backbone_weights.py
提取你的 backbone 权重。
python ./tools/model_converters/extract_backbone_weights.py {CHECKPOINT} {MODEL_FILE}
参数:
CHECKPOINT
:selfsup 方法的权重文件,名称为 epoch_*.pth 。MODEL_FILE
:输出的 backbone 权重文件。如果没有指定,下面的PRETRAIN
会使用这个提取的模型文件。
分类¶
关于分类,我们在tools/benchmarks/classification/
文件夹中提供了脚本,其中有 4 个 .sh
文件,1 个用于 VOC SVM 相关的分类任务的文件夹,1 个用于 ImageNet 最邻近分类任务的文件夹。
VOC SVM / Low-shot SVM¶
为了运行这个基准评测,你应该首先准备你的 VOC 数据集,数据准备的细节请参考prepare_data.md。
为了评估预训练的模型,你可以运行以下命令。
# 分布式版本
bash tools/benchmarks/classification/svm_voc07/dist_test_svm_pretrain.sh ${SELFSUP_CONFIG} ${GPUS} ${PRETRAIN} ${FEATURE_LIST}
# slurm 版本
bash tools/benchmarks/classification/svm_voc07/slurm_test_svm_pretrain.sh ${PARTITION} ${JOB_NAME} ${SELFSUP_CONFIG} ${PRETRAIN} ${FEATURE_LIST}
此外,如果你想评估 runner 保存的 ckpt 文件,你可以运行下面的命令。
# 分布式版本
bash tools/benchmarks/classification/svm_voc07/dist_test_svm_epoch.sh ${SELFSUP_CONFIG} ${EPOCH} ${FEATURE_LIST}
# slurm 版本
bash tools/benchmarks/classification/svm_voc07/slurm_test_svm_epoch.sh ${PARTITION} ${JOB_NAME} ${SELFSUP_CONFIG} ${EPOCH} ${FEATURE_LIST}
用ckpt测试时,代码使用epoch_*.pth文件,不需要提取权重。
备注:
${SELFSUP_CONFIG}
是自监督实验的配置文件。${FEATURE_LIST}
是一个字符串,指定 layer1 到 layer5 的特征用于评估;例如,如果你只想评估 layer5 ,那么FEATURE_LIST
是 “feat5”,如果你想评估所有的特征,那么FEATURE_LIST
是 “feat1 feat2 feat3 feat4 feat5”(用空格分隔)。如果留空,默认FEATURE_LIST
为 “feat5”。PRETRAIN
:预训练的模型文件。如果你想改变GPU的数量,你可以在命令的开头加上
GPUS_PER_NODE=4 GPUS=4
。EPOCH
是你要测试的 ckpt 的 epoch 数。
线性评估¶
线性评估是最通用的基准评测之一,我们整合了几篇论文的配置设置,也包括多头线性评估。我们在自己的代码库中为多头功能编写分类模型,因此,为了运行线性评估,我们仍然使用 .sh
脚本来启动训练。支持的数据集是ImageNet、Places205和iNaturalist18。
# 分布式版本
bash tools/benchmarks/classification/dist_train_linear.sh ${CONFIG} ${PRETRAIN}
# slurm 版本
bash tools/benchmarks/classification/slurm_train_linear.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${PRETRAIN}
备注:
默认的 GPU 数量是 8,当改变 GPUS 时,也请相应改变配置文件中的
samples_per_gpu
,以确保总 batch size 为256。CONFIG
: 使用configs/benchmarks/classification/
下的配置文件。具体有imagenet
(不包括imagenet_*percent
文件夹),places205
和inaturalist2018
。PRETRAIN
:预训练的模型文件。
ImageNet半监督分类¶
为了运行 ImageNet 半监督分类,我们仍然使用 .sh
脚本来启动训练。
# 分布式版本
bash tools/benchmarks/classification/dist_train_semi.sh ${CONFIG} ${PRETRAIN}
# slurm 版本
bash tools/benchmarks/classification/slurm_train_semi.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${PRETRAIN}
备注:
默认的 GPU 数量是4。
CONFIG
: 使用configs/benchmarks/classification/imagenet/
下的配置文件,名为imagenet_*percent
文件夹。PRETRAIN
:预训练的模型文件。
ImageNet最邻近分类¶
为了使用最邻近基准评测来评估预训练的模型,你可以运行以下命令。
# 分布式版本
bash tools/benchmarks/classification/knn_imagenet/dist_test_knn_pretrain.sh ${SELFSUP_CONFIG} ${PRETRAIN}
# slurm 版本
bash tools/benchmarks/classification/knn_imagenet/slurm_test_knn_pretrain.sh ${PARTITION} ${JOB_NAME} ${SELFSUP_CONFIG} ${PRETRAIN}
此外,如果你想评估 runner 保存的 ckpt 文件,你可以运行下面的命令。
# 分布式版本
bash tools/benchmarks/classification/knn_imagenet/dist_test_knn_epoch.sh ${SELFSUP_CONFIG} ${EPOCH}
# slurm 版本
bash tools/benchmarks/classification/knn_imagenet/slurm_test_knn_epoch.sh ${PARTITION} ${JOB_NAME} ${SELFSUP_CONFIG} ${EPOCH}
用ckpt测试时,代码使用epoch_*.pth文件,不需要提取权重。
备注:
${SELFSUP_CONFIG}
是自监督实验的配置文件。PRETRAIN
:预训练的模型文件。如果你想改变GPU的数量,你可以在命令的开头加上
GPUS_PER_NODE=4 GPUS=4
。EPOCH
是你要测试的 ckpt 的 epoch 数。
检测¶
在这里,我们倾向于使用 MMDetection 来完成检测任务。首先,确保你已经安装了MIM,它也是OpenMMLab的一个项目。
pip install openmim
安装该软件包非常容易。
安装完成后,你可以用简单的命令运行 MMDet
# 分布式版本
bash tools/benchmarks/mmdetection/mim_dist_train.sh ${CONFIG} ${PRETRAIN} ${GPUS}
# slurm 版本
bash tools/benchmarks/mmdetection/mim_slurm_train.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
备注:
CONFIG
:使用configs/benchmarks/mmdetection/
下的配置文件或编写你自己的配置文件。PRETRAIN
: 预训练的模型文件。
或者如果你想用detectron2做检测任务,我们也提供一些配置文件。 请参考INSTALL.md进行安装,并按照目录结构来准备 detectron2 所需的数据集。
conda activate detectron2 # 在这里使用 detectron2 环境,否则使用 open-mmlab 环境
cd benchmarks/detection
python convert-pretrain-to-detectron2.py ${WEIGHT_FILE} ${OUTPUT_FILE} # 必须使用 .pkl 作为输出文件扩展名
bash run.sh ${DET_CFG} ${OUTPUT_FILE}
```
## 分割
对于语义分割任务,我们使用的是 MMSegmentation 。首先,确保你已经安装了[MIM](https://github.com/open-mmlab/mim),它也是OpenMMLab的一个项目。
```shell
pip install openmim
```
安装该软件包非常容易。
此外,请参考 MMSeg 的[安装](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/get_started.md)和[数据准备](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/dataset_prepare.md#prepare-datasets)。
安装后,你可以用简单的命令运行 MMSeg
```shell
#分布式版本
bash tools/benchmarks/mmsegmentation/mim_dist_train.sh ${CONFIG} ${PRETRAIN} ${GPUS}
# slurm 版本
bash tools/benchmarks/mmsegmentation/mim_slurm_train.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
```
备注:
- `CONFIG`:使用 `configs/benchmarks/mmsegmentation/` 下的配置文件或编写自己的配置文件。
- `PRETRAIN`:预训练的模型文件。
BYOL¶
Abstract¶
Bootstrap Your Own Latent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-accum16-coslr-200e | feature5 | 86.31 | 45.37 | 56.83 | 68.47 | 74.12 | 78.30 | 81.53 | 83.56 | 84.73 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb512-coslr-90e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-accum16-coslr-200e | 15.16 | 35.26 | 47.77 | 63.10 | 71.21 | 71.72 |
resnet50_16xb256-coslr-200e | 15.41 | 35.15 | 47.77 | 62.59 | 71.85 | 71.88 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-accum16-coslr-200e | 21.25 | 36.55 | 43.66 | 50.74 | 53.82 |
resnet50_8xb32-accum16-coslr-300e | 21.18 | 36.68 | 43.42 | 51.04 | 54.06 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-accum16-coslr-200e | 63.9 | 64.2 | 62.9 | 61.9 |
resnet50_8xb32-accum16-coslr-300e | 66.1 | 66.3 | 65.2 | 64.4 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-accum16-coslr-200e | 80.35 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-accum16-coslr-200e | 40.9 | 61.0 | 44.6 | 36.8 | 58.1 | 39.5 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-accum16-coslr-200e | 67.16 |
Citation¶
@inproceedings{grill2020bootstrap,
title={Bootstrap your own latent: A new approach to self-supervised learning},
author={Grill, Jean-Bastien and Strub, Florian and Altch{\'e}, Florent and Tallec, Corentin and Richemond, Pierre H and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan Daniel and Azar, Mohammad Gheshlaghi and others},
booktitle={NeurIPS},
year={2020}
}
DeepCluster¶
Abstract¶
Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
sobel_resnet50_8xb64-steplr-200e | feature5 | 74.26 | 29.37 | 37.99 | 45.85 | 55.57 | 62.48 | 66.15 | 70.00 | 71.37 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
sobel_resnet50_8xb64-steplr-200e | 12.78 | 30.81 | 43.88 | 57.71 | 51.68 | 46.92 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
sobel_resnet50_8xb64-steplr-200e | 18.80 | 33.93 | 41.44 | 47.22 | 42.61 |
Citation¶
@inproceedings{caron2018deep,
title={Deep clustering for unsupervised learning of visual features},
author={Caron, Mathilde and Bojanowski, Piotr and Joulin, Armand and Douze, Matthijs},
booktitle={ECCV},
year={2018}
}
DenseCL¶
Abstract¶
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | feature5 | 82.5 | 42.68 | 50.64 | 61.74 | 68.17 | 72.99 | 76.07 | 79.19 | 80.55 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 15.86 | 35.47 | 49.46 | 64.06 | 62.95 | 63.34 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 21.32 | 36.20 | 43.97 | 51.04 | 50.45 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-coslr-200e | 48.2 | 48.5 | 46.8 | 45.6 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-coslr-200e | 82.14 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-coslr-200e | 69.47 |
Citation¶
@inproceedings{wang2021dense,
title={Dense contrastive learning for self-supervised visual pre-training},
author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
booktitle={CVPR},
year={2021}
}
MoCo v2¶
Abstract¶
Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | feature5 | 84.04 | 43.14 | 53.29 | 65.34 | 71.03 | 75.42 | 78.48 | 80.88 | 82.23 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 15.96 | 34.22 | 45.78 | 61.11 | 66.24 | 67.58 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 20.92 | 35.72 | 42.62 | 49.79 | 52.25 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-coslr-200e | 55.6 | 55.7 | 53.7 | 52.5 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-coslr-200e | 81.06 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 40.2 | 59.7 | 44.2 | 36.1 | 56.7 | 38.8 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-coslr-200e | 67.55 |
Citation¶
@article{chen2020improved,
title={Improved baselines with momentum contrastive learning},
author={Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming},
journal={arXiv preprint arXiv:2003.04297},
year={2020}
}
NPID¶
Abstract¶
Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether this observation can be extended beyond the conventional domain of supervised learning: Can we learn a good feature representation that captures apparent similar- ity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances?
We formulate this intuition as a non-parametric classification problem at the instance-level, and use noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes. Our experimental results demonstrate that, under unsupervised learning settings, our method surpasses the state-of-the-art on ImageNet classification by a large margin.
Our method is also remarkable for consistently improving test performance with more training data and better network architectures. By fine-tuning the learned feature, we further obtain competitive results for semi-supervised learning and object detection tasks. Our non-parametric model is highly compact: With 128 features per image, our method requires only 600MB storage for a million images, enabling fast nearest neighbour retrieval at the run time.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-steplr-200e | feature5 | 76.75 | 26.96 | 35.37 | 44.48 | 53.89 | 60.39 | 66.41 | 71.48 | 73.39 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-steplr-200e | 14.68 | 31.98 | 42.85 | 56.95 | 58.41 | 57.97 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-steplr-200e | 19.98 | 34.86 | 41.59 | 48.43 | 48.71 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-steplr-200e | 42.9 | 44.0 | 43.2 | 42.2 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-steplr-200e | 79.52 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-steplr-200e | 38.5 | 57.7 | 42.0 | 34.6 | 54.8 | 37.1 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-steplr-200e | 65.45 |
Citation¶
@inproceedings{wu2018unsupervised,
title={Unsupervised feature learning via non-parametric instance discrimination},
author={Wu, Zhirong and Xiong, Yuanjun and Yu, Stella X and Lin, Dahua},
booktitle={CVPR},
year={2018}
}
ODC¶
Abstract¶
Joint clustering and feature learning methods have shown remarkable performance in unsupervised representation learning. However, the training schedule alternating between feature clustering and network parameters update leads to unstable learning of visual representations. To overcome this challenge, we propose Online Deep Clustering (ODC) that performs clustering and network update simultaneously rather than alternatingly. Our key insight is that the cluster centroids should evolve steadily in keeping the classifier stably updated. Specifically, we design and maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution. We break down the abrupt global clustering into steady memory update and batch-wise label re-assignment. The process is integrated into network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly. Extensive experiments demonstrate that ODC stabilizes the training process and boosts the performance effectively.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb64-steplr-440e | feature5 | 78.42 | 32.42 | 40.27 | 49.95 | 59.96 | 65.71 | 69.99 | 73.64 | 75.13 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb64-steplr-440e | 14.76 | 31.82 | 42.44 | 55.76 | 57.70 | 53.42 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb64-steplr-440e | 19.28 | 34.09 | 40.90 | 47.04 | 48.35 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb64-steplr-440e | 38.5 | 39.1 | 37.8 | 36.9 |
Citation¶
@inproceedings{zhan2020online,
title={Online deep clustering for unsupervised representation learning},
author={Zhan, Xiaohang and Xie, Jiahao and Liu, Ziwei and Ong, Yew-Soon and Loy, Chen Change},
booktitle={CVPR},
year={2020}
}
Relative Location¶
Abstract¶
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the RCNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb64-steplr-70e | feature4 | 65.52 | 20.36 | 23.12 | 30.66 | 37.02 | 42.55 | 50.00 | 55.58 | 59.28 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb64-steplr-70e | 15.11 | 30.47 | 42.83 | 51.20 | 40.96 | 39.65 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb64-steplr-70e | 20.69 | 34.72 | 43.01 | 45.97 | 41.96 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb64-steplr-70e | 14.5 | 15.0 | 15.0 | 14.2 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb64-steplr-70e | 79.70 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb64-steplr-70e | 37.5 | 56.2 | 41.3 | 33.7 | 53.3 | 36.1 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb64-steplr-70e | 63.49 |
Citation¶
@inproceedings{doersch2015unsupervised,
title={Unsupervised visual representation learning by context prediction},
author={Doersch, Carl and Gupta, Abhinav and Efros, Alexei A},
booktitle={ICCV},
year={2015}
}
Rotation Prediction¶
Abstract¶
Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb16-steplr-70e | feature4 | 67.70 | 20.60 | 24.35 | 31.41 | 39.17 | 46.56 | 53.37 | 59.14 | 62.42 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-steplr-100e_in1k.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb16-steplr-70e | 12.15 | 31.99 | 44.57 | 54.20 | 45.94 | 48.12 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb16-steplr-70e | 18.94 | 34.72 | 44.53 | 46.30 | 44.12 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb16-steplr-70e | 11.0 | 11.9 | 12.6 | 12.4 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb16-steplr-70e | 79.67 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb16-steplr-70e | 37.9 | 56.5 | 41.5 | 34.2 | 53.9 | 36.7 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb16-steplr-70e | 64.31 |
Citation¶
@inproceedings{komodakis2018unsupervised,
title={Unsupervised representation learning by predicting image rotations},
author={Komodakis, Nikos and Gidaris, Spyros},
booktitle={ICLR},
year={2018}
}
SimCLR¶
Abstract¶
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | feature5 | 79.98 | 35.02 | 42.79 | 54.87 | 61.91 | 67.38 | 71.88 | 75.56 | 77.4 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb512-coslr-90e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 16.29 | 31.11 | 39.99 | 55.06 | 62.91 | 62.56 |
resnet50_16xb256-coslr-200e | 15.44 | 31.47 | 41.83 | 59.44 | 66.41 | 66.66 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 20.60 | 33.62 | 38.86 | 45.25 | 50.91 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-coslr-200e | 47.8 | 48.4 | 46.7 | 45.2 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-coslr-200e | 79.38 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-200e | 38.7 | 58.1 | 42.4 | 34.9 | 55.3 | 37.5 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-coslr-200e | 64.03 |
Citation¶
@inproceedings{chen2020simple,
title={A simple framework for contrastive learning of visual representations},
author={Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
booktitle={ICML},
year={2020},
}
SimSiam¶
Abstract¶
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-coslr-100e | feature5 | 84.64 | 39.65 | 49.86 | 62.48 | 69.50 | 74.48 | 78.31 | 81.06 | 82.56 |
resnet50_8xb32-coslr-200e | feature5 | 85.20 | 39.85 | 50.44 | 63.73 | 70.93 | 75.74 | 79.42 | 82.02 | 83.44 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb512-coslr-90e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-100e | 16.27 | 33.77 | 45.80 | 60.83 | 68.21 | 68.28 |
resnet50_8xb32-coslr-200e | 15.57 | 37.21 | 47.28 | 62.21 | 69.85 | 69.84 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-coslr-100e | 21.32 | 35.66 | 43.05 | 50.79 | 53.27 |
resnet50_8xb32-coslr-200e | 21.17 | 35.85 | 43.49 | 50.99 | 54.10 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-coslr-100e | 57.4 | 57.6 | 55.8 | 54.2 |
resnet50_8xb32-coslr-200e | 60.2 | 60.4 | 58.8 | 57.4 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-coslr-100e | 79.80 |
resnet50_8xb32-coslr-200e | 79.85 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-coslr-100e | 38.6 | 57.6 | 42.3 | 34.6 | 54.8 | 36.9 |
resnet50_8xb32-coslr-200e | 38.8 | 58.0 | 42.3 | 34.9 | 55.3 | 37.6 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-coslr-100e | 48.35 |
resnet50_8xb32-coslr-200e | 46.27 |
Citation¶
@inproceedings{chen2021exploring,
title={Exploring simple siamese representation learning},
author={Chen, Xinlei and He, Kaiming},
booktitle={CVPR},
year={2021}
}
SwAV¶
Abstract¶
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
VOC SVM / Low-shot SVM¶
The Best Layer indicates that the best results are obtained from which layers feature map. For example, if the Best Layer is feature3, its best result is obtained from the second stage of ResNet (1 for stem layer, 2-5 for 4 stage layers).
Besides, k=1 to 96 indicates the hyper-parameter of Low-shot SVM.
Self-Supervised Config | Best Layer | SVM | k=1 | k=2 | k=4 | k=8 | k=16 | k=32 | k=64 | k=96 |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | feature5 | 87.00 | 44.68 | 55.41 | 67.64 | 73.67 | 78.14 | 81.58 | 83.98 | 85.15 |
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_linear-8xb32-steplr-90e_in1k for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_linear-8xb32-coslr-100e_in1k for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 16.98 | 34.96 | 49.26 | 65.98 | 70.74 | 70.47 |
Places205 Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-28e_places205.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 |
---|---|---|---|---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 23.33 | 35.45 | 43.13 | 51.98 | 55.09 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 60.5 | 60.6 | 59.0 | 57.6 |
Detection¶
The detection benchmarks includes 2 downstream task datasets, Pascal VOC 2007 + 2012 and COCO2017. This benchmark follows the evluation protocols set up by MoCo.
Pascal VOC 2007 + 2012¶
Please refer to faster_rcnn_r50_c4_mstrain_24k_voc0712.py for details of config.
Self-Supervised Config | AP50 |
---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 77.64 |
COCO2017¶
Please refer to mask_rcnn_r50_fpn_mstrain_1x_coco.py for details of config.
Self-Supervised Config | mAP(Box) | AP50(Box) | AP75(Box) | mAP(Mask) | AP50(Mask) | AP75(Mask) |
---|---|---|---|---|---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 40.2 | 60.5 | 43.9 | 36.3 | 57.5 | 38.8 |
Segmentation¶
The segmentation benchmarks includes 2 downstream task datasets, Cityscapes and Pascal VOC 2012 + Aug. It follows the evluation protocols set up by MMSegmentation.
Pascal VOC 2012 + Aug¶
Please refer to fcn_r50-d8_512x512_20k_voc12aug.py for details of config.
Self-Supervised Config | mIOU |
---|---|
resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96 | 63.73 |
Citation¶
@article{caron2020unsupervised,
title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
author={Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
booktitle={NeurIPS},
year={2020}
}
MoCo v3¶
Abstract¶
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 4 downstream task datasets, VOC, ImageNet, iNaturalist2018 and Places205. If not specified, the results are Top-1 (%).
ImageNet Linear Evaluation¶
The Linear Evaluation result is obtained by training a linear head upon the pre-trained backbone. Please refer to vit-small-p16_8xb128-coslr-90e_in1k for details of config.
Self-Supervised Config | Linear Evaluation |
---|---|
vit-small-p16_linear-32xb128-fp16-coslr-300e_in1k-224 | 73.19 |
Citation¶
@InProceedings{Chen_2021_ICCV,
title = {An Empirical Study of Training Self-Supervised Vision Transformers},
author = {Chen, Xinlei and Xie, Saining and He, Kaiming},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2021}
}
MAE¶
Abstract¶
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

Models and Benchmarks¶
Here, we report the results of the model, which is pre-trained on ImageNet1K for 400 epochs, the details are below:
Backbone | Pre-train epoch | Fine-tuning Top-1 | Pre-train Config | Fine-tuning Config | Download |
---|---|---|---|---|---|
ViT-B/16 | 400 | 83.1 | config | config | model | log |
Citation¶
@article{He2021MaskedAA,
title={Masked Autoencoders Are Scalable Vision Learners},
author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
Piotr Doll'ar and Ross B. Girshick},
journal={ArXiv},
year={2021}
}
SimMIM¶
Abstract¶
This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as blockwise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by 40× less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https: //github.com/microsoft/SimMIM .

Models and Benchmarks¶
Here, we report the results of the model, and more results will be coming soon.
Backbone | Pre-train epoch | Fine-tuning Top-1 | Pre-train Config | Fine-tuning Config | Download |
---|---|---|---|---|---|
Swin-Base | 100 | 82.9 | config | config | model | log |
Citation¶
@inproceedings{xie2021simmim,
title={SimMIM: A Simple Framework for Masked Image Modeling},
author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
BarlowTwins¶
Abstract¶
Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.

Results and Models¶
Back to model_zoo.md to download models.
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models are pre-trained on ImageNet-1k dataset.
Classification¶
The classification benchmarks includes 1 downstream task datasets, ImageNet. If not specified, the results are Top-1 (%).
ImageNet Linear Evaluation¶
The Feature1 - Feature5 don’t have the GlobalAveragePooling, the feature map is pooled to the specific dimensions and then follows a Linear layer to do the classification. Please refer to resnet50_mhead_8xb32-steplr-90e.py for details of config.
The AvgPool result is obtained from Linear Evaluation with GlobalAveragePooling. Please refer to resnet50_8xb32-steplr-100e_in1k.py for details of config.
Self-Supervised Config | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | AvgPool |
---|---|---|---|---|---|---|
barlowtwins_resnet50_8xb256-coslr-300e_in1k | 15.51 | 33.98 | 45.96 | 61.90 | 71.01 | 71.66 |
ImageNet Nearest-Neighbor Classification¶
The results are obtained from the features after GlobalAveragePooling. Here, k=10 to 200 indicates different number of nearest neighbors.
Self-Supervised Config | k=10 | k=20 | k=100 | k=200 |
---|---|---|---|---|
barlowtwins_resnet50_8xb256-coslr-300e_in1k | 63.6 | 63.8 | 62.7 | 61.9 |
Citation¶
@inproceedings{zbontar2021barlow,
title={Barlow twins: Self-supervised learning via redundancy reduction},
author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
booktitle={International Conference on Machine Learning},
year={2021},
}
CAE¶
Abstract¶
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised learning. We randomly partition the image into two sets: visible patches and masked patches. The CAE architecture consists of: (i) an encoder that takes visible patches as input and outputs their latent representations, (ii) a latent context regressor that predicts the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as input and makes predictions for the masked patches, and (iv) an alignment module that aligns the masked patch representation estimation with the masked patch representations computed from the encoder. In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to separate the encoding role (content understanding) from the decoding role (making predictions for masked patches) using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in the latent representation space that is expected to take on semantics. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

Prerequisite¶
Create a new folder cae_ckpt
under the root directory and download the
weights for dalle
encoder to that folder
Models and Benchmarks¶
Here, we report the results of the model, which is pre-trained on ImageNet-1k for 300 epochs, the details are below:
Backbone | Pre-train epoch | Fine-tuning Top-1 | Pre-train Config | Fine-tuning Config | Download |
---|---|---|---|---|---|
ViT-B/16 | 300 | 83.2 | config | config | model | log |
Citation¶
@article{CAE,
title={Context Autoencoder for Self-Supervised Representation Learning},
author={Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo,
Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang},
journal={ArXiv},
year={2022}
}
更新日志¶
MMSelfSup¶
v0.11.0 (30/12/2022)¶
v0.10.1 (01/11/2022)¶
v0.10.0 (30/09/2022)¶
v0.9.2 (28/07/2022)¶
v0.9.1 (31/05/2022)¶
v0.5.0 (16/12/2021)¶
亮点¶
代码重构后发版。
添加 3 个新的自监督学习算法。
支持 MMDet 和 MMSeg 的基准测试。
添加全面的文档。
重构¶
合并冗余数据集文件。
适配新版 MMCV,去除旧版相关代码。
继承 MMCV BaseModule。
优化目录结构。
重命名所有配置文件。
新特性¶
添加 SwAV、SimSiam、DenseCL 算法。
添加 t-SNE 可视化工具。
支持 MMCV 版本 fp16。
基准¶
更多基准测试结果,包括分类、检测和分割。
支持下游任务中的一些新数据集。
使用 MIM 启动 MMDet 和 MMSeg 训练。
文档¶
重构 README、getting_started、install、model_zoo 文档。
添加数据准备文档。
添加全面的教程。
MMSelfSup 和 OpenSelfSup 的不同点¶
该文件记录了最新版本 MMSelfSup 和 旧版以及 OpenSelfSup之间的区别。
MMSelfSup 进行了重构并解决了许多遗留问题,它与 OpenSelfSup 并不兼容,如旧的配置文件需要更新,因为某些类或组件的名称已被修改。
主要的不同点为:代码库约定,模块化设计。
模块化设计¶
为了构建更加清晰的目录结构, MMSelfSup 重新设计了一些模块。
数据集¶
MMSelfSup 合并了部分数据集,减少了冗余代码.
Classification, Extraction, NPID -> OneViewDataset
BYOL, Contrastive -> MultiViewDataset
重构了
data_sources
文件夹,现在的数据读取函数更加鲁棒。
另外,该部分仍然在重构中,会在接下里的某一版本中发布。
模型¶
注册机制已经更新。 现在,
models
文件夹下的各部分在构建时会有一个从MMCV
中引入的父类MMCV_MODELS
。请查阅 mmselfsup/models/builder.py 和 mmcv/utils/registry.py 获取更多信息。models
文件夹包含algorithms
,backbones
,necks
,heads
,memories
和一些所需的工具。algorithms
部分集成了其他主要的组件来构建自监督学习算法,就像MMCls
中的classifiers
或者MMDet
中的detectors
。在 OpenSelfSup 中,
necks
的命名会有一些混乱并且实现在同一个 python 文件中。现在,necks
部分已经被重构,通过一个文件进行归类管理,并进行重新命名。请查阅mmselfsup/models/necks
获取更多信息。
代码库约定¶
由于 OpenSelfSup 很久没有更新,MMSelfSup 更新了代码库约定。
配置文件¶
MMSelfSup 对所有配置文件进行了重命名,并制定了命名规范,请查阅 0_config 获取更多信息。
在配置文件中,一些类的参数命名或组件名字已被修改。
一个算法名被修改: MOCO -> MoCo
由于所有模型的组件继承自
MMCV
的BaseModule
,模型根据init_cfg
进行初始化。请您依照该规范进行初始化设置,init_weights
仍然适用。请使用新的 necks 的命名来组合算,请在写配置文件前确认。
归一化层通过
norm_cfg
进行管理。
脚本¶
tools
的目录结构被修改,现在更加清晰,通过多个文件夹进行脚本分类和管理。 例如,两个转换文件夹分为为了模型和数据格式。 另外,和基准测试相关的脚本都在benchmarks
文件夹中,它和configs/benchmarks
拥有相同的目录结构。train.py
脚本中的参数已被更新, 两个主要修改点为增加
--cfg-options
参数,通过命令行传参对配置文件进行修改。移除
--pretrained
, 通过--cfg-options
设置预训练模型。
mmselfsup.apis¶
- mmselfsup.apis.inference_model(model: torch.nn.modules.module.Module, data: <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/mmselfsup-zh-cn/envs/0.x/lib/python3.7/site-packages/PIL/Image.py'>) → Tuple[torch.Tensor, Union[torch.Tensor, dict]][源代码]¶
Inference an image with the model. :param model: The loaded model. :type model: nn.Module :param data: The loaded image. :type data: PIL.Image
- 返回
- Output of model
inference. - data (torch.Tensor): The loaded image to input model. - output (torch.Tensor, dict[str, torch.Tensor]): the output
of test model.
- 返回类型
Tuple[torch.Tensor, Union(torch.Tensor, dict)]
- mmselfsup.apis.init_model(config: Union[str, mmcv.utils.config.Config], checkpoint: Optional[str] = None, device: str = 'cuda:0', options: Optional[dict] = None) → torch.nn.modules.module.Module[源代码]¶
Initialize an model from config file.
- 参数
config (str or :obj:
mmcv.Config
) – Config file path or the config object.checkpoint (str, optional) – Checkpoint path. If left as None, the model will not load any weights. Defaults to None.
device (str) – The device where the model will be put on. Default to ‘cuda:0’.
options (dict, optional) – Options to override some settings in the used config. Defaults to None.
- 返回
The initialized model.
- 返回类型
nn.Module
- mmselfsup.apis.init_random_seed(seed=None, device='cuda')[源代码]¶
Initialize random seed.
If the seed is not set, the seed will be automatically randomized, and then broadcast to all processes to prevent some potential bugs. :param seed: The seed. Default to None. :type seed: int, Optional :param device: The device where the seed will be put on.
Default to ‘cuda’.
- 返回
Seed to be used.
- 返回类型
int
- mmselfsup.apis.set_random_seed(seed, deterministic=False)[源代码]¶
Set random seed.
- 参数
seed (int) – Seed to be used.
deterministic (bool) – Whether to set the deterministic option for CUDNN backend, i.e., set torch.backends.cudnn.deterministic to True and torch.backends.cudnn.benchmark to False. Defaults to False.
mmselfsup.core¶
hooks¶
- class mmselfsup.core.hooks.DeepClusterHook(extractor, clustering, unif_sampling, reweight, reweight_pow, init_memory=False, initial=True, interval=1, dist_mode=True, data_loaders=None)[源代码]¶
Hook for DeepCluster.
This hook includes the global clustering process in DC.
- 参数
extractor (dict) – Config dict for feature extraction.
clustering (dict) – Config dict that specifies the clustering algorithm.
unif_sampling (bool) – Whether to apply uniform sampling.
reweight (bool) – Whether to apply loss re-weighting.
reweight_pow (float) – The power of re-weighting.
init_memory (bool) – Whether to initialize memory banks used in ODC. Defaults to False.
initial (bool) – Whether to call the hook initially. Defaults to True.
interval (int) – Frequency of epochs to call the hook. Defaults to 1.
dist_mode (bool) – Use distributed training or not. Defaults to True.
data_loaders (DataLoader) – A PyTorch dataloader. Defaults to None.
- class mmselfsup.core.hooks.DenseCLHook(start_iters=1000, **kwargs)[源代码]¶
Hook for DenseCL.
This hook includes
loss_lambda
warmup in DenseCL. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL.- 参数
start_iters (int, optional) – The number of warmup iterations to set
loss_lambda=0
. Defaults to 1000.
- class mmselfsup.core.hooks.DistOptimizerHook(update_interval=1, grad_clip=None, coalesce=True, bucket_size_mb=- 1, frozen_layers_cfg={})[源代码]¶
Optimizer hook for distributed training.
This hook can accumulate gradients every n intervals and freeze some layers for some iters at the beginning.
- 参数
update_interval (int, optional) – The update interval of the weights, set > 1 to accumulate the grad. Defaults to 1.
grad_clip (dict, optional) – Dict to config the value of grad clip. E.g., grad_clip = dict(max_norm=10). Defaults to None.
coalesce (bool, optional) – Whether allreduce parameters as a whole. Defaults to True.
bucket_size_mb (int, optional) – Size of bucket, the unit is MB. Defaults to -1.
frozen_layers_cfg (dict, optional) – Dict to config frozen layers. The key-value pair is layer name and its frozen iters. If frozen, the layer gradient would be set to None. Defaults to dict().
- class mmselfsup.core.hooks.GradAccumFp16OptimizerHook(update_interval=1, frozen_layers_cfg={}, **kwargs)[源代码]¶
Fp16 optimizer hook (using PyTorch’s implementation).
This hook can accumulate gradients every n intervals and freeze some layers for some iters at the beginning. If you are using PyTorch >= 1.6, torch.cuda.amp is used as the backend, to take care of the optimization procedure.
- 参数
update_interval (int, optional) – The update interval of the weights, set > 1 to accumulate the grad. Defaults to 1.
frozen_layers_cfg (dict, optional) – Dict to config frozen layers. The key-value pair is layer name and its frozen iters. If frozen, the layer gradient would be set to None. Defaults to dict().
- after_train_iter(runner)[源代码]¶
Backward optimization steps for Mixed Precision Training. For dynamic loss scaling, please refer to https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.
Scale the loss by a scale factor.
Backward the loss to obtain the gradients.
Unscale the optimizer’s gradient tensors.
Call optimizer.step() and update scale factor.
Save loss_scaler state_dict for resume purpose.
- class mmselfsup.core.hooks.InterCLRHook(extractor, clustering, centroids_update_interval, deal_with_small_clusters_interval, evaluate_interval, warmup_epochs=0, init_memory=True, initial=True, online_labels=True, interval=1, dist_mode=True, data_loaders=None)[源代码]¶
Hook for InterCLR.
This hook includes the clustering process in InterCLR.
- 参数
extractor (dict) – Config dict for feature extraction.
clustering (dict) – Config dict that specifies the clustering algorithm.
centroids_update_interval (int) – Frequency of iterations to update centroids.
deal_with_small_clusters_interval (int) – Frequency of iterations to deal with small clusters.
evaluate_interval (int) – Frequency of iterations to evaluate clusters.
warmup_epochs (int, optional) – The number of warmup epochs to set
intra_loss_weight=1
andinter_loss_weight=0
. Defaults to 0.init_memory (bool) – Whether to initialize memory banks used in online labels. Defaults to True.
initial (bool) – Whether to call the hook initially. Defaults to True.
online_labels (bool) – Whether to use online labels. Defaults to True.
interval (int) – Frequency of epochs to call the hook. Defaults to 1.
dist_mode (bool) – Use distributed training or not. Defaults to True.
data_loaders (DataLoader) – A PyTorch dataloader. Defaults to None.
- class mmselfsup.core.hooks.MomentumUpdateHook(end_momentum=1.0, update_interval=1, **kwargs)[源代码]¶
Hook for updating momentum parameter, used by BYOL, MoCoV3, etc.
This hook includes momentum adjustment following:
\[m = 1 - (1 - m_0) * (cos(pi * k / K) + 1) / 2\]where \(k\) is the current step, \(K\) is the total steps.
- 参数
end_momentum (float) – The final momentum coefficient for the target network. Defaults to 1.
update_interval (int, optional) – The momentum update interval of the weights. Defaults to 1.
- class mmselfsup.core.hooks.ODCHook(centroids_update_interval, deal_with_small_clusters_interval, evaluate_interval, reweight, reweight_pow, dist_mode=True)[源代码]¶
Hook for ODC.
This hook includes the online clustering process in ODC.
- 参数
centroids_update_interval (int) – Frequency of iterations to update centroids.
deal_with_small_clusters_interval (int) – Frequency of iterations to deal with small clusters.
evaluate_interval (int) – Frequency of iterations to evaluate clusters.
reweight (bool) – Whether to perform loss re-weighting.
reweight_pow (float) – The power of re-weighting.
dist_mode (bool) – Use distributed training or not. Defaults to True.
- class mmselfsup.core.hooks.SimSiamHook(fix_pred_lr, lr, adjust_by_epoch=True, **kwargs)[源代码]¶
Hook for SimSiam.
This hook is for SimSiam to fix learning rate of predictor.
- 参数
fix_pred_lr (bool) – whether to fix the lr of predictor or not.
lr (float) – the value of fixed lr.
adjust_by_epoch (bool, optional) – whether to set lr by epoch or iter. Defaults to True.
- class mmselfsup.core.hooks.StepFixCosineAnnealingLrUpdaterHook(min_lr: Optional[float] = None, min_lr_ratio: Optional[float] = None, **kwargs)[源代码]¶
- class mmselfsup.core.hooks.SwAVHook(batch_size, epoch_queue_starts=15, crops_for_assign=[0, 1], feat_dim=128, queue_length=0, interval=1, **kwargs)[源代码]¶
Hook for SwAV.
This hook builds the queue in SwAV according to
epoch_queue_starts
. The queue will be saved inrunner.work_dir
or loaded at start epoch if the path folder has queues saved before.- 参数
batch_size (int) – the batch size per GPU for computing.
epoch_queue_starts (int, optional) – from this epoch, starts to use the queue. Defaults to 15.
crops_for_assign (list[int], optional) – list of crops id used for computing assignments. Defaults to [0, 1].
feat_dim (int, optional) – feature dimension of output vector. Defaults to 128.
queue_length (int, optional) – length of the queue (0 for no queue). Defaults to 0.
interval (int, optional) – the interval to save the queue. Defaults to 1.
optimizer¶
- class mmselfsup.core.optimizer.DefaultOptimizerConstructor(optimizer_cfg, paramwise_cfg=None)[源代码]¶
Rewrote default constructor for optimizers. By default each parameter share the same optimizer settings, and we provide an argument
paramwise_cfg
to specify parameter-wise settings. It is a dict and may contain the following fields: :param model: The model with parameters to be optimized. :type model:nn.Module
:param optimizer_cfg: The config dict of the optimizer.- Positional fields are
type: class name of the optimizer.
- Optional fields are
any arguments of the corresponding optimizer type, e.g., lr, weight_decay, momentum, etc.
- 参数
paramwise_cfg (dict, optional) – Parameter-wise options. Defaults to None.
- Example 1:
>>> model = torch.nn.modules.Conv1d(1, 1, 1) >>> optimizer_cfg = dict(type='SGD', lr=0.01, momentum=0.9, >>> weight_decay=0.0001) >>> paramwise_cfg = dict('bias': dict(weight_decay=0., lars_exclude=True)) >>> optim_builder = DefaultOptimizerConstructor( >>> optimizer_cfg, paramwise_cfg) >>> optimizer = optim_builder(model)
- class mmselfsup.core.optimizer.LARS(params, lr=<required parameter>, momentum=0, weight_decay=0, dampening=0, eta=0.001, nesterov=False, eps=1e-08)[源代码]¶
Implements layer-wise adaptive rate scaling for SGD.
- 参数
params (iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – Base learning rate.
momentum (float, optional) – Momentum factor. Defaults to 0 (‘m’)
weight_decay (float, optional) – Weight decay (L2 penalty). Defaults to 0. (‘beta’)
dampening (float, optional) – Dampening for momentum. Defaults to 0.
eta (float, optional) – LARS coefficient. Defaults to 0.001.
nesterov (bool, optional) – Enables Nesterov momentum. Defaults to False.
eps (float, optional) – A small number to avoid dviding zero. Defaults to 1e-8.
Based on Algorithm 1 of the following paper by You, Gitman, and Ginsburg. `Large Batch Training of Convolutional Networks:
示例
>>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9, >>> weight_decay=1e-4, eta=1e-3) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
- class mmselfsup.core.optimizer.TransformerFinetuneConstructor(optimizer_cfg, paramwise_cfg=None)[源代码]¶
Rewrote default constructor for optimizers.
By default each parameter share the same optimizer settings, and we provide an argument
paramwise_cfg
to specify parameter-wise settings. In addition, we provide two optional parameters,model_type
andlayer_decay
to set the commonly used layer-wise learning rate decay schedule. Currently, we only support layer-wise learning rate schedule for swin and vit.- 参数
optimizer_cfg (dict) –
The config dict of the optimizer. Positional fields are
type: class name of the optimizer.
- Optional fields are
any arguments of the corresponding optimizer type, e.g., lr, weight_decay, momentum, model_type, layer_decay, etc.
paramwise_cfg (dict, optional) – Parameter-wise options. Defaults to None.
- Example 1:
>>> model = torch.nn.modules.Conv1d(1, 1, 1) >>> optimizer_cfg = dict(type='SGD', lr=0.01, momentum=0.9, >>> weight_decay=0.0001, model_type='vit') >>> paramwise_cfg = dict('bias': dict(weight_decay=0., lars_exclude=True)) >>> optim_builder = TransformerFinetuneConstructor( >>> optimizer_cfg, paramwise_cfg) >>> optimizer = optim_builder(model)
- mmselfsup.core.optimizer.build_optimizer(model, optimizer_cfg)[源代码]¶
Build optimizer from configs.
- 参数
model (
nn.Module
) – The model with parameters to be optimized.optimizer_cfg (dict) –
The config dict of the optimizer. Positional fields are:
type: class name of the optimizer.
lr: base learning rate.
- Optional fields are:
any arguments of the corresponding optimizer type, e.g., weight_decay, momentum, etc.
paramwise_options: a dict with regular expression as keys to match parameter names and a dict containing options as values. Options include 6 fields: lr, lr_mult, momentum, momentum_mult, weight_decay, weight_decay_mult.
- 返回
The initialized optimizer.
- 返回类型
torch.optim.Optimizer
示例
>>> model = torch.nn.modules.Conv1d(1, 1, 1) >>> paramwise_options = { >>> '(bn|gn)(\d+)?.(weight|bias)': dict(weight_decay_mult=0.1), >>> '\Ahead.': dict(lr_mult=10, momentum=0)} >>> optimizer_cfg = dict(type='SGD', lr=0.01, momentum=0.9, >>> weight_decay=0.0001, >>> paramwise_options=paramwise_options) >>> optimizer = build_optimizer(model, optimizer_cfg)
mmselfsup.datasets¶
data_sources¶
- class mmselfsup.datasets.data_sources.BaseDataSource(data_prefix, classes=None, ann_file=None, test_mode=False, color_type='color', channel_order='rgb', file_client_args={'backend': 'disk'})[源代码]¶
Datasource base class to load dataset information.
- 参数
data_prefix (str) – the prefix of data path.
classes (str | Sequence[str], optional) – Specify classes to load.
ann_file (str | None) – the annotation file. When ann_file is str, the subclass is expected to read from the ann_file. When ann_file is None, the subclass is expected to read according to data_prefix.
test_mode (bool) – in train mode or test mode. Defaults to False.
color_type (str) – The flag argument for
mmcv.imfrombytes()
. Defaults to color.channel_order (str) – The channel order of images when loaded. Defaults to rgb.
file_client_args (dict) – Arguments to instantiate a FileClient. See
mmcv.fileio.FileClient
for details. Defaults to dict(backend=’disk’).
- get_cat_ids(idx)[源代码]¶
Get category id by index.
- 参数
idx (int) – Index of data.
- 返回
Image category of specified index.
- 返回类型
int
- classmethod get_classes(classes=None)[源代码]¶
Get class names of current dataset.
- 参数
classes (Sequence[str] | str | None) – If classes is None, use default CLASSES defined by builtin dataset. If classes is a string, take it as a file name. The file contains the name of classes where each line contains one class name. If classes is a tuple or list, override the CLASSES defined by the dataset.
- 返回
Names of categories of the dataset.
- 返回类型
tuple[str] or list[str]
- class mmselfsup.datasets.data_sources.CIFAR10(data_prefix, classes=None, ann_file=None, test_mode=False, color_type='color', channel_order='rgb', file_client_args={'backend': 'disk'})[源代码]¶
CIFAR10 Dataset.
This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py
- class mmselfsup.datasets.data_sources.CIFAR100(data_prefix, classes=None, ann_file=None, test_mode=False, color_type='color', channel_order='rgb', file_client_args={'backend': 'disk'})[源代码]¶
CIFAR100 Dataset.
- class mmselfsup.datasets.data_sources.ImageList(data_prefix, classes=None, ann_file=None, test_mode=False, color_type='color', channel_order='rgb', file_client_args={'backend': 'disk'})[源代码]¶
The implementation for loading any image list file.
The ImageList can load an annotation file or a list of files and merge all data records to one list. If data is unlabeled, the gt_label will be set -1.
- class mmselfsup.datasets.data_sources.ImageNet(data_prefix, classes=None, ann_file=None, test_mode=False, color_type='color', channel_order='rgb', file_client_args={'backend': 'disk'})[源代码]¶
ImageNet Dataset.
This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/imagenet.py
- class mmselfsup.datasets.data_sources.ImageNet21k(data_prefix, classes=None, ann_file=None, multi_label=False, recursion_subdir=False, test_mode=False)[源代码]¶
ImageNet21k Dataset. Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. This class has improved the following points on the basis of the class
ImageNet
, in order to save memory usage and time required :Delete the samples attribute
using ‘slots’ create a Data_item tp replace dict
Modify setting
info
dict from functionload_annotations
to functionprepare_data
using int instead of np.array(…, np.int64)
- 参数
data_prefix (str) – the prefix of data path
ann_file (str | None) – the annotation file. When ann_file is str, the subclass is expected to read from the ann_file. When ann_file is None, the subclass is expected to read according to data_prefix
test_mode (bool) – in train mode or test mode
multi_label (bool) – use multi label or not.
recursion_subdir (bool) – whether to use sub-directory pictures, which are meet the conditions in the folder under category directory.
pipelines¶
- class mmselfsup.datasets.pipelines.BEiTMaskGenerator(input_size: int, num_masking_patches: int, min_num_patches: int = 4, max_num_patches: Optional[int] = None, min_aspect: float = 0.3, max_aspect: Optional[float] = None)[源代码]¶
Generate mask for image.
This module is borrowed from https://github.com/microsoft/unilm/tree/master/beit
- 参数
input_size (int) – The size of input image.
num_masking_patches (int) – The number of patches to be masked.
min_num_patches (int) – The minimum number of patches to be masked in the process of generating mask. Defaults to 4.
max_num_patches (int, optional) – The maximum number of patches to be masked in the process of generating mask. Defaults to None.
min_aspect (float, optional) – The minimum aspect ratio of mask blocks. Defaults to 0.3.
min_aspect – The minimum aspect ratio of mask blocks. Defaults to None.
- class mmselfsup.datasets.pipelines.GaussianBlur(sigma_min, sigma_max, p=0.5)[源代码]¶
GaussianBlur augmentation refers to `SimCLR.
<https://arxiv.org/abs/2002.05709>`_.
- 参数
sigma_min (float) – The minimum parameter of Gaussian kernel std.
sigma_max (float) – The maximum parameter of Gaussian kernel std.
p (float, optional) – Probability. Defaults to 0.5.
- class mmselfsup.datasets.pipelines.Lighting(alphastd=0.1)[源代码]¶
Lighting noise(AlexNet - style PCA - based noise).
- 参数
alphastd (float, optional) – The parameter for Lighting. Defaults to 0.1.
- class mmselfsup.datasets.pipelines.MaskFeatMaskGenerator(mask_window_size: int = 14, mask_ratio: float = 0.4, min_num_patches: int = 15, max_num_patches: Optional[int] = None, min_aspect: float = 0.3, max_aspect: Optional[float] = None)[源代码]¶
Generate random block mask for each image.
This module is borrowed from https://github.com/facebookresearch/SlowFast/blob/main/slowfast/datasets/transform.py :param mask_window_size: Size of input image. Defaults to 14. :type mask_window_size: int :param mask_ratio: The mask ratio of image. Defaults to 0.4. :type mask_ratio: float :param min_num_patches: Minimum number of patches that require masking.
Defaults to 15.
- 参数
max_num_patches (int, optional) – Maximum number of patches that require masking. Defaults to None.
min_aspect (int) – Minimum aspect of patches. Defaults to 0.3.
max_aspect (float, optional) – Maximum aspect of patches. Defaults to None.
- class mmselfsup.datasets.pipelines.RandomAppliedTrans(transforms, p=0.5)[源代码]¶
Randomly applied transformations.
- 参数
transforms (list[dict]) – List of transformations in dictionaries.
p (float, optional) – Probability. Defaults to 0.5.
- class mmselfsup.datasets.pipelines.RandomAug(input_size=None, color_jitter=None, auto_augment=None, interpolation=None, re_prob=None, re_mode=None, re_count=None, mean=None, std=None)[源代码]¶
RandAugment data augmentation method based on “RandAugment: Practical automated data augmentation with a reduced search space”.
This code is borrowed from <https://github.com/pengzhiliang/MAE-pytorch>
- class mmselfsup.datasets.pipelines.SimMIMMaskGenerator(input_size: int = 192, mask_patch_size: int = 32, model_patch_size: int = 4, mask_ratio: float = 0.6)[源代码]¶
Generate random block mask for each Image.
This module is used in SimMIM to generate masks.
- 参数
input_size (int) – Size of input image. Defaults to 192.
mask_patch_size (int) – Size of each block mask. Defaults to 32.
model_patch_size (int) – Patch size of each token. Defaults to 4.
mask_ratio (float) – The mask ratio of image. Defaults to 0.6.
- class mmselfsup.datasets.pipelines.Solarization(threshold=128, p=0.5)[源代码]¶
Solarization augmentation refers to `BYOL.
<https://arxiv.org/abs/2006.07733>`_.
- 参数
threshold (float, optional) – The solarization threshold. Defaults to 128.
p (float, optional) – Probability. Defaults to 0.5.
samplers¶
- class mmselfsup.datasets.samplers.DistributedGivenIterationSampler(dataset, total_iter, batch_size, num_replicas=None, rank=None, last_iter=- 1)[源代码]¶
- class mmselfsup.datasets.samplers.DistributedGroupSampler(dataset, samples_per_gpu=1, num_replicas=None, rank=None)[源代码]¶
Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.注解
Dataset is assumed to be of constant size.
- 参数
dataset – Dataset used for sampling.
num_replicas (optional) – Number of processes participating in distributed training.
rank (optional) – Rank of the current process within num_replicas.
datasets¶
- class mmselfsup.datasets.BaseDataset(data_source, pipeline, prefetch=False)[源代码]¶
Base dataset class.
The base dataset can be inherited by different algorithm’s datasets. After __init__, the data source and pipeline will be built. Besides, the algorithm specific dataset implements different operations after obtaining images from data sources.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
pipeline (list[dict]) – A list of dict, where each element represents an operation defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
- class mmselfsup.datasets.ConcatDataset(datasets)[源代码]¶
A wrapper of concatenated dataset.
Same as
torch.utils.data.dataset.ConcatDataset
, but concat the group flag for image aspect ratio.- 参数
datasets (list[
Dataset
]) – A list of datasets.
- class mmselfsup.datasets.DeepClusterDataset(data_source, pipeline, prefetch=False)[源代码]¶
Dataset for DC and ODC.
The dataset initializes clustering labels and assigns it during training.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
pipeline (list[dict]) – A list of dict, where each element represents an operation defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
- class mmselfsup.datasets.MultiViewDataset(data_source, num_views, pipelines, prefetch=False)[源代码]¶
The dataset outputs multiple views of an image.
The number of views in the output dict depends on num_views. The image can be processed by one pipeline or multiple piepelines.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
num_views (list) – The number of different views.
pipelines (list[list[dict]]) – A list of pipelines, where each pipeline contains elements that represents an operation defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
实际案例
>>> dataset = MultiViewDataset(data_source, [2], [pipeline]) >>> output = dataset[idx] The output got 2 views processed by one pipeline.
>>> dataset = MultiViewDataset( >>> data_source, [2, 6], [pipeline1, pipeline2]) >>> output = dataset[idx] The output got 8 views processed by two pipelines, the first two views were processed by pipeline1 and the remaining views by pipeline2.
- class mmselfsup.datasets.RelativeLocDataset(data_source, pipeline, format_pipeline, prefetch=False)[源代码]¶
Dataset for relative patch location.
The dataset crops image into several patches and concatenates every surrounding patch with center one. Finally it also outputs corresponding labels 0, 1, 2, 3, 4, 5, 6, 7.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
pipeline (list[dict]) – A list of dict, where each element represents an operation defined in mmselfsup.datasets.pipelines.
format_pipeline (list[dict]) – A list of dict, it converts input format from PIL.Image to Tensor. The operation is defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
- class mmselfsup.datasets.RepeatDataset(dataset, times)[源代码]¶
A wrapper of repeated dataset.
The length of repeated dataset will be times larger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.
- 参数
dataset (
Dataset
) – The dataset to be repeated.times (int) – Repeat times.
- class mmselfsup.datasets.RotationPredDataset(data_source, pipeline, prefetch=False)[源代码]¶
Dataset for rotation prediction.
The dataset rotates the image with 0, 90, 180, and 270 degrees and outputs labels 0, 1, 2, 3 correspodingly.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
pipeline (list[dict]) – A list of dict, where each element represents an operation defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
- class mmselfsup.datasets.SingleViewDataset(data_source, pipeline, prefetch=False)[源代码]¶
The dataset outputs one view of an image, containing some other information such as label, idx, etc.
- 参数
data_source (dict) – Data source defined in mmselfsup.datasets.data_sources.
pipeline (list[dict]) – A list of dict, where each element represents an operation defined in mmselfsup.datasets.pipelines.
prefetch (bool, optional) – Whether to prefetch data. Defaults to False.
- evaluate(results, logger=None, topk=(1, 5))[源代码]¶
The evaluation function to output accuracy.
- 参数
results (dict) – The key-value pair is the output head name and corresponding prediction values.
logger (logging.Logger | str | None, optional) – The defined logger to be used. Defaults to None.
topk (tuple(int)) – The output includes topk accuracy.
- mmselfsup.datasets.build_dataloader(dataset, imgs_per_gpu=None, samples_per_gpu=None, workers_per_gpu=1, num_gpus=1, dist=True, shuffle=True, replace=False, seed=None, pin_memory=True, persistent_workers=True, **kwargs)[源代码]¶
Build PyTorch DataLoader.
In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.
- 参数
dataset (Dataset) – A PyTorch dataset.
imgs_per_gpu (int) – (Deprecated, please use samples_per_gpu) Number of images on each GPU, i.e., batch size of each GPU. Defaults to None.
samples_per_gpu (int) – Number of images on each GPU, i.e., batch size of each GPU. Defaults to None.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU. persistent_workers option needs num_workers > 0. Defaults to 1.
num_gpus (int) – Number of GPUs. Only used in non-distributed training.
dist (bool) – Distributed training/test or not. Defaults to True.
shuffle (bool) – Whether to shuffle the data at every epoch. Defaults to True.
replace (bool) – Replace or not in random shuffle. It works on when shuffle is True. Defaults to False.
seed (int) – set seed for dataloader.
pin_memory (bool, optional) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. Defaults to True.
persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. The argument also has effect in PyTorch>=1.7.0. Defaults to True.
kwargs – any keyword argument to be used to initialize DataLoader
- 返回
A PyTorch dataloader.
- 返回类型
DataLoader
mmselfsup.models¶
algorithms¶
- class mmselfsup.models.algorithms.BYOL(backbone, neck=None, head=None, base_momentum=0.996, init_cfg=None, **kwargs)[源代码]¶
BYOL.
Implementation of Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. The momentum adjustment is in core/hooks/byol_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.996.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- class mmselfsup.models.algorithms.BarlowTwins(backbone: Optional[dict] = None, neck: Optional[dict] = None, head: Optional[dict] = None, init_cfg: Optional[dict] = None, **kwargs)[源代码]¶
BarlowTwins.
Implementation of Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Part of the code is borrowed from: https://github.com/facebookresearch/barlowtwins/blob/main/main.py.
- 参数
backbone (dict) – Config dict for module of backbone. Defaults to None.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
init_cfg (dict) – Config dict for weight initialization. Defaults to None.
- class mmselfsup.models.algorithms.BaseModel(init_cfg=None)[源代码]¶
Base model class for self-supervised learning.
- abstract extract_feat(imgs)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images. Typically these should be mean centered
std scaled. (and) –
- forward(img, mode='train', **kwargs)[源代码]¶
Forward function of model.
Calls either forward_train, forward_test or extract_feat function according to the mode.
- forward_test(imgs, **kwargs)[源代码]¶
- 参数
img (Tensor) – List of tensors. Typically these should be mean centered and std scaled.
kwargs (keyword arguments) – Specific to concrete implementation.
- abstract forward_train(imgs, **kwargs)[源代码]¶
- 参数
img ([Tensor) – List of tensors. Typically these should be mean centered and std scaled.
kwargs (keyword arguments) – Specific to concrete implementation.
- train_step(data, optimizer)[源代码]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating are also defined in this method, such as GAN.
- 参数
data (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- 返回
- Dict of outputs. The following fields are contained.
loss (torch.Tensor): A tensor for back propagation, which can be a weighted sum of multiple losses.
log_vars (dict): Dict contains all the variables to be sent to the logger.
num_samples (int): Indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- 返回类型
dict
- val_step(data, optimizer)[源代码]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- class mmselfsup.models.algorithms.CAE(backbone: Optional[dict] = None, neck: Optional[dict] = None, head: Optional[dict] = None, base_momentum: float = 0.0, init_cfg: Optional[dict] = None, **kwargs)[源代码]¶
CAE.
Implementation of Context Autoencoder for Self-Supervised Representation Learning.
- 参数
backbone (dict, optional) – Config dict for module of backbone.
neck (dict, optional) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict, optional) – Config dict for module of loss functions. Defaults to None.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.0.
init_cfg (dict, optional) – the config to control the initialization.
- extract_feat(img: torch.Tensor, mask: torch.Tensor) → torch.Tensor[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images. Typically these should be mean centered
std scaled. (and) –
- class mmselfsup.models.algorithms.Classification(backbone, with_sobel=False, head=None, train_cfg=None, init_cfg=None)[源代码]¶
Simple image classification.
- 参数
backbone (dict) – Config dict for module of backbone.
with_sobel (bool) – Whether to apply a Sobel filter. Defaults to False.
head (dict) – Config dict for module of loss functions. Defaults to None.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_test(img, **kwargs)[源代码]¶
Forward computation during test.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of output features.
- 返回类型
dict[str, Tensor]
- forward_train(img, label, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
label (Tensor) – Ground-truth labels.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.DeepCluster(backbone, with_sobel=True, neck=None, head=None, init_cfg=None)[源代码]¶
DeepCluster.
Implementation of Deep Clustering for Unsupervised Learning of Visual Features. The clustering operation is in core/hooks/deepcluster_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
with_sobel (bool) – Whether to apply a Sobel filter on images. Defaults to True.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_test(img, **kwargs)[源代码]¶
Forward computation during test.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of output features.
- 返回类型
dict[str, Tensor]
- forward_train(img, pseudo_label, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
pseudo_label (Tensor) – Label assignments.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.DenseCL(backbone, neck=None, head=None, queue_len=65536, feat_dim=128, momentum=0.999, loss_lambda=0.5, init_cfg=None, **kwargs)[源代码]¶
DenseCL.
Implementation of Dense Contrastive Learning for Self-Supervised Visual Pre-Training. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL. The loss_lambda warmup is in core/hooks/densecl_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
loss_lambda (float) – Loss weight for the single and dense contrastive loss. Defaults to 0.5.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_test(img, **kwargs)[源代码]¶
Forward computation during test.
- 参数
img (Tensor) – Input of two concatenated images of shape (N, 2, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of normalized output features.
- 返回类型
dict(Tensor)
- class mmselfsup.models.algorithms.InterCLRMoCo(backbone, neck=None, head=None, queue_len=65536, feat_dim=128, momentum=0.999, memory_bank=None, online_labels=True, neg_num=16384, neg_sampling='semihard', semihard_neg_pool_num=128000, semieasy_neg_pool_num=128000, intra_cos_marign_loss=False, intra_cos_margin=0, intra_arc_marign_loss=False, intra_arc_margin=0, inter_cos_marign_loss=True, inter_cos_margin=- 0.5, inter_arc_marign_loss=False, inter_arc_margin=0, intra_loss_weight=0.75, inter_loss_weight=0.25, share_neck=True, num_classes=10000, init_cfg=None, **kwargs)[源代码]¶
MoCo-InterCLR.
Official implementation of Delving into Inter-Image Invariance for Unsupervised Visual Representations. The clustering operation is in core/hooks/interclr_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
memory_bank (dict) – Config dict for module of memory banks. Defaults to None.
online_labels (bool) – Whether to use online labels. Defaults to True.
neg_num (int) – Number of negative samples for inter-image branch. Defaults to 16384.
neg_sampling (str) – Negative sampling strategy. Support ‘hard’, ‘semihard’, ‘random’, ‘semieasy’. Defaults to ‘semihard’.
semihard_neg_pool_num (int) – Number of negative samples for semi-hard nearest neighbor pool. Defaults to 128000.
semieasy_neg_pool_num (int) – Number of negative samples for semi-easy nearest neighbor pool. Defaults to 128000.
intra_cos_marign_loss (bool) – Whether to use a cosine margin for intra-image branch. Defaults to False.
intra_cos_marign (float) – Intra-image cosine margin. Defaults to 0.
intra_arc_marign_loss (bool) – Whether to use an arc margin for intra-image branch. Defaults to False.
intra_arc_marign (float) – Intra-image arc margin. Defaults to 0.
inter_cos_marign_loss (bool) – Whether to use a cosine margin for inter-image branch. Defaults to True.
inter_cos_marign (float) – Inter-image cosine margin. Defaults to -0.5.
inter_arc_marign_loss (bool) – Whether to use an arc margin for inter-image branch. Defaults to False.
inter_arc_marign (float) – Inter-image arc margin. Defaults to 0.
intra_loss_weight (float) – Loss weight for intra-image branch. Defaults to 0.75.
inter_loss_weight (float) – Loss weight for inter-image branch. Defaults to 0.25.
share_neck (bool) – Whether to share the neck for intra- and inter-image branches. Defaults to True.
num_classes (int) – Number of clusters. Defaults to 10000.
- contrast_inter(q, idx)[源代码]¶
Inter-image invariance learning.
- 参数
q (Tensor) – Query features with shape (N, C).
idx (Tensor) – Index corresponding to each query.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- contrast_intra(q, k)[源代码]¶
Intra-image invariance learning.
- 参数
q (Tensor) – Query features with shape (N, C).
k (Tensor) – Key features with shape (N, C).
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_train(img, idx, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (list[Tensor]) – A list of input images with shape (N, C, H, W). Typically these should be mean centered and std scaled.
idx (Tensor) – Index corresponding to each image.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.MAE(backbone: dict, neck: dict, head: dict, init_cfg: Optional[dict] = None)[源代码]¶
MAE.
Implementation of Masked Autoencoders Are Scalable Vision Learners.
- 参数
backbone (dict) – Config dict for encoder. Defaults to None.
neck (dict) – Config dict for encoder. Defaults to None.
head (dict) – Config dict for loss functions. Defaults to None.
init_cfg (dict, optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(img: torch.Tensor) → Tuple[torch.Tensor][源代码]¶
Function to extract features from backbone.
- 参数
img (torch.Tensor) – Input images of shape (N, C, H, W).
- 返回
backbone outputs.
- 返回类型
Tuple[torch.Tensor]
- forward_test(img: torch.Tensor, **kwargs) → Tuple[torch.Tensor, torch.Tensor][源代码]¶
Forward computation during testing.
- 参数
img (torch.Tensor) – Input images of shape (N, C, H, W).
kwargs – Any keyword arguments to be used to forward.
- 返回
- Output of model test.
mask: Mask used to mask image.
pred: The output of neck.
- 返回类型
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.algorithms.MMClsImageClassifierWrapper(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, pretrained: Optional[str] = None, train_cfg: Optional[dict] = None, init_cfg: Optional[dict] = None)[源代码]¶
Workaround to use models from mmclassificaiton.
Since the output of classifier from mmclassification is not compatible with mmselfsup’s evaluation function. We rewrite some key components from mmclassification.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict, optional) – Config dict for module of neck. Defaults to None.
head (dict, optional) – Config dict for module of loss functions. Defaults to None.
pretrained (str, optional) – The path of pre-trained checkpoint. Defaults to None.
train_cfg (dict, optional) – Config dict for pre-processing utils, e.g. mixup. Defaults to None.
init_cfg (dict, optional) – Config dict for initialization. Defaults to None.
- forward(img, mode='train', **kwargs)[源代码]¶
Forward function of model.
Calls either forward_train, forward_test or extract_feat function according to the mode.
- forward_test(imgs, **kwargs)[源代码]¶
- 参数
imgs (List[Tensor]) – the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.
- forward_train(img, label, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
label (Tensor) – It should be of shape (N, 1) encoding the ground-truth label of input images for single label task. It shoulf be of shape (N, C) encoding the ground-truth label of input images for multi-labels task.
- 返回
a dictionary of loss components
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.MaskFeat(backbone: dict, head: dict, hog_para: dict, init_cfg: Optional[dict] = None)[源代码]¶
MaskFeat.
Implementation of Masked Feature Prediction for Self-Supervised Visual Pre-Training. :param backbone: Config dict for encoder. :type backbone: dict :param head: Config dict for loss functions. :type head: dict :param hog_para: Config dict for hog layer.
dict[‘nbins’, int]: Number of bin. Defaults to 9. dict[‘pool’, float]: Number of cell. Defaults to 8. dict[‘gaussian_window’, int]: Size of gaussian kernel.
Defaults to 16.
- 参数
init_cfg (dict) – Config dict for weight initialization. Defaults to None.
- class mmselfsup.models.algorithms.MoCo(backbone, neck=None, head=None, queue_len=65536, feat_dim=128, momentum=0.999, init_cfg=None, **kwargs)[源代码]¶
MoCo.
Implementation of Momentum Contrast for Unsupervised Visual Representation Learning. Part of the code is borrowed from: https://github.com/facebookresearch/moco/blob/master/moco/builder.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
- class mmselfsup.models.algorithms.MoCoV3(backbone, neck, head, base_momentum=0.99, init_cfg=None, **kwargs)[源代码]¶
MoCo v3.
Implementation of An Empirical Study of Training Self-Supervised Vision Transformers.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
base_momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.99.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images. Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- class mmselfsup.models.algorithms.NPID(backbone, neck=None, head=None, memory_bank=None, neg_num=65536, ensure_neg=False, init_cfg=None)[源代码]¶
NPID.
Implementation of Unsupervised Feature Learning via Non-parametric Instance Discrimination.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
memory_bank (dict) – Config dict for module of memory banks. Defaults to None.
neg_num (int) – Number of negative samples for each image. Defaults to 65536.
ensure_neg (bool) – If False, there is a small probability that negative samples contain positive ones. Defaults to False.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_train(img, idx, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
idx (Tensor) – Index corresponding to each image.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.ODC(backbone, with_sobel=False, neck=None, head=None, memory_bank=None, init_cfg=None)[源代码]¶
ODC.
Official implementation of Online Deep Clustering for Unsupervised Representation Learning. The operation w.r.t. memory bank and loss re-weighting is in
core/hooks/odc_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
with_sobel (bool) – Whether to apply a Sobel filter on images. Defaults to False.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
memory_bank (dict) – Module of memory banks. Defaults to None.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward_test(img, **kwargs)[源代码]¶
Forward computation during test.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of output features.
- 返回类型
dict[str, Tensor]
- forward_train(img, idx, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
idx (Tensor) – Index corresponding to each image.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.RelativeLoc(backbone, neck=None, head=None, init_cfg=None)[源代码]¶
Relative patch location.
Implementation of Unsupervised Visual Representation Learning by Context Prediction.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward(img, patch_label=None, mode='train', **kwargs)[源代码]¶
Forward function to select mode and modify the input image shape.
- 参数
img (Tensor) – Input images, the shape depends on mode. Typically these should be mean centered and std scaled.
- forward_test(img, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of output features.
- 返回类型
dict[str, Tensor]
- forward_train(img, patch_label, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
patch_label (Tensor) – Labels for the relative patch locations.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.RotationPred(backbone, head=None, init_cfg=None)[源代码]¶
Rotation prediction.
Implementation of Unsupervised Representation Learning by Predicting Image Rotations.
- 参数
backbone (dict) – Config dict for module of backbone.
head (dict) – Config dict for module of loss functions. Defaults to None.
- extract_feat(img)[源代码]¶
Function to extract features from backbone.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
backbone outputs.
- 返回类型
tuple[Tensor]
- forward(img, rot_label=None, mode='train', **kwargs)[源代码]¶
Forward function to select mode and modify the input image shape.
- 参数
img (Tensor) – Input images, the shape depends on mode. Typically these should be mean centered and std scaled.
- forward_test(img, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
- 返回
A dictionary of output features.
- 返回类型
dict[str, Tensor]
- forward_train(img, rot_label, **kwargs)[源代码]¶
Forward computation during training.
- 参数
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
rot_label (Tensor) – Labels for the rotations.
kwargs – Any keyword arguments to be used to forward.
- 返回
A dictionary of loss components.
- 返回类型
dict[str, Tensor]
- class mmselfsup.models.algorithms.SimCLR(backbone, neck=None, head=None, init_cfg=None)[源代码]¶
SimCLR.
Implementation of A Simple Framework for Contrastive Learning of Visual Representations.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
- class mmselfsup.models.algorithms.SimMIM(backbone: dict, neck: dict, head: dict, init_cfg: Optional[dict] = None)[源代码]¶
SimMIM.
Implementation of SimMIM: A Simple Framework for Masked Image Modeling.
- 参数
backbone (dict) – Config dict for encoder. Defaults to None.
neck (dict) – Config dict for encoder. Defaults to None.
head (dict) – Config dict for loss functions. Defaults to None.
init_cfg (dict, optional) – Config dict for weight initialization. Defaults to None.
- class mmselfsup.models.algorithms.SimSiam(backbone, neck=None, head=None, init_cfg=None, **kwargs)[源代码]¶
SimSiam.
Implementation of Exploring Simple Siamese Representation Learning. The operation of fixing learning rate of predictor is in core/hooks/simsiam_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
- class mmselfsup.models.algorithms.SwAV(backbone, neck=None, head=None, init_cfg=None, **kwargs)[源代码]¶
SwAV.
Implementation of Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. The queue is built in core/hooks/swav_hook.py.
- 参数
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors. Defaults to None.
head (dict) – Config dict for module of loss functions. Defaults to None.
backbones¶
- class mmselfsup.models.backbones.CAEViT(arch: str = 'b', img_size: int = 224, patch_size: int = 16, out_indices: int = - 1, drop_rate: float = 0, drop_path_rate: float = 0, qkv_bias: bool = True, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', init_values: Optional[float] = None, patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[dict] = None)[源代码]¶
Vision Transformer for CAE pre-training.
Rewritten version of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 参数
arch (str | dict) – Vision Transformer architecture. Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
init_values (float, optional) – The init value of gamma in TransformerEncoderLayer.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.backbones.MAEViT(arch='b', img_size=224, patch_size=16, out_indices=- 1, drop_rate=0, drop_path_rate=0, norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, output_cls_token=True, interpolate_mode='bicubic', patch_cfg={}, layer_cfgs={}, mask_ratio=0.75, init_cfg=None)[源代码]¶
Vision Transformer for MAE pre-training.
A PyTorch implement of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 参数
arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
mask_ratio (bool) – The ratio of total number of patches to be masked. Defaults to 0.75.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x)[源代码]¶
Forward computation.
- 参数
x (tensor | tuple[tensor]) – x could be a Torch.tensor or a tuple of Torch.tensor, containing input data for forward computation.
- random_masking(x, mask_ratio=0.75)[源代码]¶
Generate the mask for MAE Pre-training.
- 参数
x (torch.tensor) – Image with data augmentation applied.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.75.
- 返回
- masked image, mask and the ids
to restore original image.
x_masked (Tensor): masked image.
mask (Tensor): mask used to mask image.
ids_restore (Tensor): ids to restore original image.
- 返回类型
tuple[Tensor, Tensor, Tensor]
- class mmselfsup.models.backbones.MIMVisionTransformer(arch='b', img_size=224, patch_size=16, out_indices=- 1, use_window=False, drop_rate=0, drop_path_rate=0, qkv_bias=True, norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, output_cls_token=True, interpolate_mode='bicubic', init_values=0.0, patch_cfg={}, layer_cfgs={}, finetune=True, init_cfg=None)[源代码]¶
Vision Transformer for MIM-style model (Mask Image Modeling) classification (fine-tuning or linear probe).
A PyTorch implement of : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 参数
arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
finetune (bool) – Whether or not do fine-tuning. Defaults to True.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.backbones.MaskFeatViT(arch: Union[str, dict] = 'b', img_size: Union[Tuple[int, int], int] = 224, patch_size: int = 16, out_indices: int = - 1, drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[dict] = None)[源代码]¶
Vision Transformer for MaskFeat pre-training.
A PyTorch implement of: Masked Feature Prediction for Self-Supervised Visual Pre-Training. :param arch: Vision Transformer architecture
Default: ‘b’
- 参数
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.backbones.ResNeXt(depth, groups=32, width_per_group=4, **kwargs)[源代码]¶
ResNeXt backbone.
Please refer to the paper for details.
As the behavior of forward function in MMSelfSup is different from MMCls, we register our own ResNeXt, inheriting from mmselfsup.model.backbone.ResNet.
- 参数
depth (int) – Network depth, from {50, 101, 152}.
groups (int) – Groups of conv2 in Bottleneck. Defaults to 32.
width_per_group (int) – Width per group of conv2 in Bottleneck. Defaults to 4.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Defaults to
(3, )
.style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.
示例
>>> from mmselfsup.models import ResNeXt >>> import torch >>> self = ResNeXt(depth=50) >>> self.eval() >>> inputs = torch.rand(1, 3, 32, 32) >>> level_outputs = self.forward(inputs) >>> for level_out in level_outputs: ... print(tuple(level_out.shape)) (1, 256, 8, 8) (1, 512, 4, 4) (1, 1024, 2, 2) (1, 2048, 1, 1)
- class mmselfsup.models.backbones.ResNet(depth, in_channels=3, stem_channels=64, base_channels=64, expansion=None, num_stages=4, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(4), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=- 1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=False, with_cp=False, zero_init_residual=False, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}], drop_path_rate=0.0, **kwargs)[源代码]¶
ResNet backbone.
Please refer to the paper for details.
- 参数
depth (int) – Network depth, from {18, 34, 50, 101, 152}.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
base_channels (int) – Middle channels of the first stage. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.out_indices (Sequence[int]) – Output from which stages. Defaults to
(4, )
.style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.
of the path to be zeroed. Defaults to 0.1 (Probability) –
示例
>>> from mmselfsup.models import ResNet >>> import torch >>> self = ResNet(depth=18) >>> self.eval() >>> inputs = torch.rand(1, 3, 32, 32) >>> level_outputs = self.forward(inputs) >>> for level_out in level_outputs: ... print(tuple(level_out.shape)) (1, 64, 8, 8) (1, 128, 4, 4) (1, 256, 2, 2) (1, 512, 1, 1)
- class mmselfsup.models.backbones.ResNetV1d(**kwargs)[源代码]¶
ResNetV1d variant described in Bag of Tricks.
Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is changed to 1.
- class mmselfsup.models.backbones.SimMIMSwinTransformer(arch: Union[str, dict] = 'T', img_size: Union[Tuple[int, int], int] = 224, in_channels: int = 3, drop_rate: float = 0.0, drop_path_rate: float = 0.1, out_indices: tuple = (3), use_abs_pos_embed: bool = False, with_cp: bool = False, frozen_stages: bool = - 1, norm_eval: bool = False, norm_cfg: dict = {'type': 'LN'}, stage_cfgs: Union[Sequence, dict] = {}, patch_cfg: dict = {}, init_cfg: Optional[dict] = None)[源代码]¶
Swin Transformer for SimMIM.
- 参数
Args –
arch (str | dict) – Swin Transformer architecture Defaults to ‘T’.
img_size (int | tuple) – The size of input image. Defaults to 224.
in_channels (int) – The num of input channels. Defaults to 3.
drop_rate (float) – Dropout rate after embedding. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
out_indices (tuple) – Layers to be outputted. Defaults to (3, ).
use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
norm_cfg (dict) – Config dict for normalization layer at end of backone. Defaults to dict(type=’LN’)
stage_cfgs (Sequence | dict) – Extra config dict for each stage. Defaults to empty dict.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to empty dict.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.
- forward(x: torch.Tensor, mask: torch.Tensor) → Sequence[torch.Tensor][源代码]¶
Generate features for masked images.
This function generates mask images and get the hidden features for them.
- 参数
x (torch.Tensor) – Input images.
mask (torch.Tensor) – Masks used to construct masked images.
- 返回
A tuple containing features from multi-stages.
- 返回类型
tuple
- class mmselfsup.models.backbones.VisionTransformer(stop_grad_conv1=False, frozen_stages=- 1, norm_eval=False, init_cfg=None, **kwargs)[源代码]¶
Vision Transformer.
A pytorch implement of: An Images is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/vits.py.
- 参数
stop_grad_conv1 (bool, optional) – whether to stop the gradient of convolution layer in PatchEmbed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
heads¶
- class mmselfsup.models.heads.CAEHead(tokenizer_path: str, lambd: float, init_cfg: Optional[dict] = None)[源代码]¶
Pretrain Head for CAE.
Compute the align loss and the main loss. In addition, this head also generates the prediction target generated by dalle.
- 参数
tokenizer_path (str) – The path of the tokenizer.
lambd (float) – The weight for the align loss.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(img_target: torch.Tensor, outputs: torch.Tensor, latent_pred: torch.Tensor, latent_target: torch.Tensor, mask: torch.Tensor) → dict[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.heads.ClsHead(with_avg_pool=False, in_channels=2048, num_classes=1000, vit_backbone=False, init_cfg=[{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
Simplest classifier head, with only one fc layer.
- 参数
with_avg_pool (bool) – Whether to apply the average pooling after neck. Defaults to False.
in_channels (int) – Number of input channels. Defaults to 2048.
num_classes (int) – Number of classes. Defaults to 1000.
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.heads.ContrastiveHead(temperature=0.1)[源代码]¶
Head for contrastive learning.
The contrastive loss is implemented in this head and is used in SimCLR, MoCo, DenseCL, etc.
- 参数
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 0.1.
- class mmselfsup.models.heads.LatentClsHead(in_channels: int, num_classes: int, init_cfg: dict = {'layer': 'Linear', 'std': 0.01, 'type': 'Normal'})[源代码]¶
Head for latent feature classification.
- 参数
in_channels (int) – Number of input channels.
num_classes (int) – Number of classes.
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.heads.LatentCrossCorrelationHead(in_channels: int, lambd: float = 0.0051)[源代码]¶
Head for latent feature cross correlation. Part of the code is borrowed from: `https://github.com/facebookresearch/barlowtwins/blob/main/main.py>`_.
- 参数
in_channels (int) – Number of input channels.
lambd (float) – Weight on off-diagonal terms. Defaults to 0.0051.
- class mmselfsup.models.heads.LatentPredictHead(predictor: dict)[源代码]¶
Head for latent feature prediction.
This head builds a predictor, which can be any registered neck component. For example, BYOL and SimSiam call this head and build NonLinearNeck. It also implements similarity loss between two forward features.
- 参数
predictor (dict) – Config dict for the predictor.
- class mmselfsup.models.heads.MAEFinetuneHead(embed_dim, num_classes=1000, label_smooth_val=0.1)[源代码]¶
Fine-tuning head for MAE.
- 参数
embed_dim (int) – The dim of the feature before the classifier head.
num_classes (int) – The total classes. Defaults to 1000.
- class mmselfsup.models.heads.MAELinprobeHead(embed_dim, num_classes=1000)[源代码]¶
Linear probing head for MAE.
- 参数
embed_dim (int) – The dim of the feature before the classifier head.
num_classes (int) – The total classes. Defaults to 1000.
- class mmselfsup.models.heads.MAEPretrainHead(norm_pix: bool = False, patch_size: int = 16)[源代码]¶
Pre-training head for MAE.
- 参数
norm_pix_loss (bool) – Whether or not normalize target. Defaults to False.
patch_size (int) – Patch size. Defaults to 16.
- forward(x: torch.Tensor, pred: torch.Tensor, mask: torch.Tensor) → dict[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.heads.MaskFeatFinetuneHead(embed_dim: int, num_classes: int = 1000, label_smooth_val: float = 0.1)[源代码]¶
Fine-tuning head for MaskFeat.
- 参数
embed_dim (int) – The dim of the feature before the classifier head.
num_classes (int) – The total classes. Defaults to 1000.
label_smooth_val (float) – The degree of label smoothing. Defaults to 0.1.
- class mmselfsup.models.heads.MaskFeatPretrainHead(embed_dim: int = 768, hog_dim: int = 108)[源代码]¶
Pre-training head for MaskFeat.
- 参数
embed_dim (int) – The dim of the feature before the classifier head. Defaults to 768.
hog_dim (int) – The dim of the hog feature. Defaults to 108.
- forward(latent: torch.Tensor, hog: torch.Tensor, mask: torch.Tensor) → dict[源代码]¶
Pre-training head for MaskFeat.
- 参数
latent (torch.Tensor) – Input latent of shape (N, 1+L, C).
hog (torch.Tensor) – Input hog feature of shape (N, L, C).
mask (torch.Tensor) – Input mask of shape (N, H, W).
- 返回
A dictionary of loss components.
- 返回类型
Dict[str, torch.Tensor]
- loss(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → dict[源代码]¶
Compute the loss.
- 参数
pred (torch.Tensor) – Input prediction of shape (N, L, C).
target (torch.Tensor) – Input target of shape (N, L, C).
mask (torch.Tensor) – Input mask of shape (N, L, 1).
- 返回
A dictionary of loss components.
- 返回类型
dict[str, torch.Tensor]
- class mmselfsup.models.heads.MoCoV3Head(predictor, temperature=1.0)[源代码]¶
Head for MoCo v3 algorithms.
This head builds a predictor, which can be any registered neck component. It also implements latent contrastive loss between two forward features. Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/moco/builder.py.
- 参数
predictor (dict) – Config dict for module of predictor.
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 1.0.
- class mmselfsup.models.heads.MultiClsHead(pool_type='adaptive', in_indices=(0), with_last_layer_unpool=False, backbone='resnet50', norm_cfg={'type': 'BN'}, num_classes=1000, init_cfg=[{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
Multiple classifier heads.
This head inputs feature maps from different stages of backbone, average pools each feature map to around 9000 dimensions, and then appends a linear classifier at each stage to predict corresponding class scores.
- 参数
pool_type (str) – ‘adaptive’ or ‘specified’. If set to ‘adaptive’, use adaptive average pooling, otherwise use specified pooling params.
in_indices (Sequence[int]) – Input from which stages.
with_last_layer_unpool (bool) – Whether to unpool the features from last layer. Defaults to False.
backbone (str) – Specify which backbone to use. Defaults to ‘resnet50’.
norm_cfg (dict) – dictionary to construct and config norm layer.
num_classes (int) – Number of classes. Defaults to 1000.
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.heads.SimMIMHead(patch_size: int, encoder_in_channels: int)[源代码]¶
Pretrain Head for SimMIM.
- 参数
patch_size (int) – Patch size of each token.
encoder_in_channels (int) – Number of input channels for encoder.
- forward(x: torch.Tensor, x_rec: torch.Tensor, mask: torch.Tensor) → dict[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.heads.SwAVHead(feat_dim, sinkhorn_iterations=3, epsilon=0.05, temperature=0.1, crops_for_assign=[0, 1], num_crops=[2], num_prototypes=3000, init_cfg=None)[源代码]¶
The head for SwAV.
This head contains clustering and sinkhorn algorithms to compute Q codes. Part of the code is borrowed from: `<https://github.com/facebookresearch/swav`_. The queue is built in core/hooks/swav_hook.py.
- 参数
feat_dim (int) – feature dimension of the prototypes.
sinkhorn_iterations (int) – number of iterations in Sinkhorn-Knopp algorithm. Defaults to 3.
epsilon (float) – regularization parameter for Sinkhorn-Knopp algorithm. Defaults to 0.05.
temperature (float) – temperature parameter in training loss. Defaults to 0.1.
crops_for_assign (list[int]) – list of crops id used for computing assignments. Defaults to [0, 1].
num_crops (list[int]) – list of number of crops. Defaults to [2].
num_prototypes (int) – number of prototypes. Defaults to 3000.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
memories¶
- class mmselfsup.models.memories.InterCLRMemory(length, feat_dim, momentum, num_classes, min_cluster, **kwargs)[源代码]¶
Memory bank for InterCLR.
- 参数
length (int) – Number of features stored in the memory bank.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
num_classes (int) – Number of clusters.
min_cluster (int) – Minimal cluster size.
- class mmselfsup.models.memories.ODCMemory(length, feat_dim, momentum, num_classes, min_cluster, **kwargs)[源代码]¶
Memory module for ODC.
This module includes the samples memory and the centroids memory in ODC. The samples memory stores features and pseudo-labels of all samples in the dataset; while the centroids memory stores features of cluster centroids.
- 参数
length (int) – Number of features stored in samples memory.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
num_classes (int) – Number of clusters.
min_cluster (int) – Minimal cluster size.
- class mmselfsup.models.memories.SimpleMemory(length, feat_dim, momentum, **kwargs)[源代码]¶
Simple memory bank (e.g., for NPID).
This module includes the memory bank that stores running average features of all samples in the dataset.
- 参数
length (int) – Number of features stored in the memory bank.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
necks¶
- class mmselfsup.models.necks.CAENeck(patch_size: int = 16, num_classes: int = 8192, embed_dims: int = 768, regressor_depth: int = 6, decoder_depth: int = 8, num_heads: int = 12, mlp_ratio: int = 4, qkv_bias: bool = True, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_values: Optional[float] = None, mask_tokens_num: int = 75, init_cfg: Optional[dict] = None)[源代码]¶
Neck for CAE Pre-training.
This module construct the latent prediction regressor and the decoder for the latent prediction and final prediction.
- 参数
patch_size (int) – The patch size of each token. Defaults to 16.
num_classes (int) – The number of classes for final prediction. Defaults to 8192.
embed_dims (int) – The embed dims of latent feature in regressor and decoder. Defaults to 768.
regressor_depth (int) – The number of regressor blocks. Defaults to 6.
decoder_depth (int) – The number of decoder blocks. Defaults to 8.
num_heads (int) – The number of head in multi-head attention. Defaults to 12.
mlp_ratio (int) – The expand ratio of latent features in MLP. defaults to 4.
qkv_bias (bool) – Whether or not to use qkv bias. Defaults to True.
qk_scale (float, optional) – The scale applied to the results of qk. Defaults to None.
drop_rate (float) – The dropout rate. Defaults to 0.
attn_drop_rate (float) – The dropout rate in attention block. Defaults to 0.
norm_cfg (dict) – The config of normalization layer. Defaults to dict(type=’LN’, eps=1e-6).
init_values (float, optional) – The init value of gamma. Defaults to None.
mask_tokens_num (int) – The number of mask tokens. Defaults to 75.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x_unmasked: torch.Tensor, pos_embed_masked: torch.Tensor, pos_embed_unmasked: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][源代码]¶
Get the latent prediction and final prediction.
- 参数
x_unmasked (torch.Tensor) – Features of unmasked tokens.
pos_embed_masked (torch.Tensor) – Position embedding of masked tokens.
pos_embed_unmasked (torch.Tensor) – Position embedding of unmasked tokens.
- 返回
- Final prediction and latent
prediction.
- 返回类型
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.necks.DenseCLNeck(in_channels, hid_channels, out_channels, num_grid=None, init_cfg=None)[源代码]¶
The non-linear neck of DenseCL.
Single and dense neck in parallel: fc-relu-fc, conv-relu-conv. Borrowed from the authors’ code: `<https://github.com/WXinlong/DenseCL`_.
- 参数
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_grid (int) – The grid size of dense features. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.necks.LinearNeck(in_channels, out_channels, with_avg_pool=True, init_cfg=None)[源代码]¶
The linear neck: fc only.
- 参数
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.MAEPretrainDecoder(num_patches=196, patch_size=16, in_chans=3, embed_dim=1024, decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16, mlp_ratio=4.0, norm_cfg={'eps': 1e-06, 'type': 'LN'})[源代码]¶
Decoder for MAE Pre-training.
- 参数
num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
Some of the code is borrowed from https://github.com/facebookresearch/mae.
示例
>>> from mmselfsup.models import MAEPretrainDecoder >>> import torch >>> self = MAEPretrainDecoder() >>> self.eval() >>> inputs = torch.rand(1, 50, 1024) >>> ids_restore = torch.arange(0, 196).unsqueeze(0) >>> level_outputs = self.forward(inputs, ids_restore) >>> print(tuple(level_outputs.shape)) (1, 196, 768)
- forward(x, ids_restore)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.MoCoV2Neck(in_channels, hid_channels, out_channels, with_avg_pool=True, init_cfg=None)[源代码]¶
The non-linear neck of MoCo v2: fc-relu-fc.
- 参数
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.NonLinearNeck(in_channels, hid_channels, out_channels, num_layers=2, with_bias=False, with_last_bn=True, with_last_bn_affine=True, with_last_bias=False, with_avg_pool=True, vit_backbone=False, norm_cfg={'type': 'SyncBN'}, init_cfg=[{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
The non-linear neck.
Structure: fc-bn-[relu-fc-bn] where the substructure in [] can be repeated. For the default setting, the repeated time is 1. The neck can be used in many algorithms, e.g., SimCLR, BYOL, SimSiam.
- 参数
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_layers (int) – Number of fc layers. Defaults to 2.
with_bias (bool) – Whether to use bias in fc layers (except for the last). Defaults to False.
with_last_bn (bool) – Whether to add the last BN layer. Defaults to True.
with_last_bn_affine (bool) – Whether to have learnable affine parameters in the last BN layer (set False for SimSiam). Defaults to True.
with_last_bias (bool) – Whether to use bias in the last fc layer. Defaults to False.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.ODCNeck(in_channels, hid_channels, out_channels, with_avg_pool=True, norm_cfg={'type': 'SyncBN'}, init_cfg=[{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
The non-linear neck of ODC: fc-bn-relu-dropout-fc-relu.
- 参数
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.RelativeLocNeck(in_channels, out_channels, with_avg_pool=True, norm_cfg={'type': 'BN1d'}, init_cfg=[{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
The neck of relative patch location: fc-bn-relu-dropout.
- 参数
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’BN1d’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.SimMIMNeck(in_channels: int, encoder_stride: int)[源代码]¶
Pre-train Neck For SimMIM.
This neck reconstructs the original image from the shrunk feature map.
- 参数
in_channels (int) – Channel dimension of the feature map.
encoder_stride (int) – The total stride of the encoder.
- forward(x: torch.Tensor) → torch.Tensor[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.necks.SwAVNeck(in_channels, hid_channels, out_channels, with_avg_pool=True, with_l2norm=True, norm_cfg={'type': 'SyncBN'}, init_cfg=[{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[源代码]¶
The non-linear neck of SwAV: fc-bn-relu-fc-normalization.
- 参数
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
with_l2norm (bool) – whether to normalize the output after projection. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
utils¶
- class mmselfsup.models.utils.Accuracy(topk=(1))[源代码]¶
Implementation of accuracy computation.
- forward(pred, target)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.CAETransformerRegressorLayer(embed_dims: int, num_heads: int, feedforward_channels: int, num_fcs: int = 2, qkv_bias: bool = False, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, init_values: float = 0.0, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'})[源代码]¶
Transformer layer for the regressor of CAE.
This module is different from conventional transformer encoder layer, for its queries are the masked tokens, but its keys and values are the concatenation of the masked and unmasked tokens.
- 参数
embed_dims (int) – The feature dimension.
num_heads (int) – The number of heads in multi-head attention.
feedforward_channels (int) – The hidden dimension of FFNs. Defaults: 1024.
num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.
qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.drop_rate (float) – The dropout rate. Defaults to 0.0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
init_values (float) – The init values of gamma. Defaults to 0.0.
act_cfg (dict) – The activation config for FFNs. Defaluts to
dict(type='GELU')
.norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.
- forward(x_q: torch.Tensor, x_kv: torch.Tensor, pos_q: torch.Tensor, pos_k: torch.Tensor) → torch.Tensor[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.Encoder(n_hid: int = 256, n_blk_per_group: int = 2, input_channels: int = 3, vocab_size: int = 8192, device: torch.device = device(type='cpu'), requires_grad: bool = False, use_mixed_precision: bool = True)[源代码]¶
- forward(x: torch.Tensor) → torch.Tensor[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.ExtractProcess[源代码]¶
Global average-pooled feature extraction process.
This process extracts the global average-pooled features from the last layer of resnet backbone.
- class mmselfsup.models.utils.GatherLayer(*args, **kwargs)[源代码]¶
Gather tensors from all process, supporting backward propagation.
- static backward(ctx, *grads)[源代码]¶
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs as theforward()
returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
- static forward(ctx, input)[源代码]¶
Performs the operation.
This function is to be overridden by all subclasses.
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with
ctx.save_for_backward()
if they are intended to be used inbackward
(equivalently,vjp
) orctx.save_for_forward()
if they are intended to be used for injvp
.
- class mmselfsup.models.utils.MultiExtractProcess(pool_type='specified', backbone='resnet50', layer_indices=(0, 1, 2, 3, 4))[源代码]¶
Multi-stage intermediate feature extraction process for extract.py and tsne_visualization.py in tools.
This process extracts feature maps from different stages of backbone, and average pools each feature map to around 9000 dimensions.
- 参数
pool_type (str) – Pooling type in
MultiPooling
. Options are “adaptive” and “specified”. Defaults to “specified”.backbone (str) – Backbone type, now only support “resnet50”. Defaults to “resnet50”.
layer_indices (Sequence[int]) – Output from which stages. 0 for stem, 1, 2, 3, 4 for res layers. Defaults to (0, 1, 2, 3, 4).
- class mmselfsup.models.utils.MultiPooling(pool_type='adaptive', in_indices=(0), backbone='resnet50')[源代码]¶
Pooling layers for features from multiple depth.
- 参数
pool_type (str) – Pooling type for the feature map. Options are ‘adaptive’ and ‘specified’. Defaults to ‘adaptive’.
in_indices (Sequence[int]) – Output from which backbone stages. Defaults to (0, ).
backbone (str) – The selected backbone. Defaults to ‘resnet50’.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.MultiPrototypes(output_dim, num_prototypes)[源代码]¶
Multi-prototypes for SwAV head.
- 参数
output_dim (int) – The output dim from SwAV neck.
num_prototypes (list[int]) – The number of prototypes needed.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.MultiheadAttention(embed_dims: int, num_heads: int, input_dims: Optional[int] = None, attn_drop: float = 0.0, proj_drop: float = 0.0, qkv_bias: bool = True, qk_scale: Optional[float] = None, proj_bias: bool = True, init_cfg: Optional[dict] = None)[源代码]¶
Multi-head Attention Module.
This module rewrite the MultiheadAttention by replacing qkv bias with customized qkv bias, in addition to removing the drop path layer.
- 参数
embed_dims (int) – The embedding dimension.
num_heads (int) – Parallel attention heads.
input_dims (int, optional) – The input dimension, and if None, use
embed_dims
. Defaults to None.attn_drop (float) – Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0.
proj_drop (float) – Dropout rate of the dropout layer after the output projection. Defaults to 0.
dropout_layer (dict) – The dropout config before adding the shortcut. Defaults to
dict(type='Dropout', drop_prob=0.)
.qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.proj_bias (bool) – Defaults to True.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.
- forward(x: torch.Tensor) → torch.Tensor[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.Sobel[源代码]¶
Sobel layer.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.utils.TransformerEncoderLayer(embed_dims: int, num_heads: int, feedforward_channels: int, window_size: Optional[int] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, num_fcs: int = 2, qkv_bias: bool = True, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'type': 'LN'}, init_values: float = 0.0, init_cfg: Optional[dict] = None)[源代码]¶
Implements one encoder layer in Vision Transformer.
This module is the rewritten version of the TransformerEncoderLayer in MMClassification by adding the gamma and relative position bias in Attention module.
- 参数
embed_dims (int) – The feature dimension.
num_heads (int) – Parallel attention heads
feedforward_channels (int) – The hidden dimension for FFNs
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Defaults to 0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
num_fcs (int) – The number of fully-connected layers for FFNs. Defaults to 2.
qkv_bias (bool) – enable bias for qkv if True. Defaults to True.
act_cfg (dict) – The activation config for FFNs. Defaluts to
dict(type='GELU')
.norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.init_values (float) – The init values of gamma. Defaults to 0.0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor) → torch.Tensor[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- mmselfsup.models.utils.accuracy(pred, target, topk=1)[源代码]¶
Compute accuracy of predictions.
- 参数
pred (Tensor) – The output of the model.
target (Tensor) – The labels of data.
topk (int | list[int]) – Top-k metric selection. Defaults to 1.
- mmselfsup.models.utils.build_2d_sincos_position_embedding(patches_resolution, embed_dims, temperature=10000.0, cls_token=False)[源代码]¶
The function is to build position embedding for model to obtain the position information of the image patches.
- mmselfsup.models.utils.knn_classifier(train_features, train_labels, test_features, test_labels, k, T, num_classes=1000)[源代码]¶
Compute accuracy of knn classifier predictions.
- 参数
train_features (Tensor) – Extracted features in the training set.
train_labels (Tensor) – Labels in the training set.
test_features (Tensor) – Extracted features in the testing set.
test_labels (Tensor) – Labels in the testing set.
k (int) – Number of NN to use.
T (float) – Temperature used in the voting coefficient.
num_classes (int) – Number of classes. Defaults to 1000.
mmselfsup.utils¶
- class mmselfsup.utils.AliasMethod(probs)[源代码]¶
The alias method for sampling.
- 参数
probs (Tensor) – Sampling probabilities.
- class mmselfsup.utils.Extractor(dataset, samples_per_gpu, workers_per_gpu, dist_mode=False, persistent_workers=True, **kwargs)[源代码]¶
Feature extractor.
- 参数
dataset (Dataset | dict) – A PyTorch dataset or dict that indicates the dataset.
samples_per_gpu (int) – Number of images on each GPU, i.e., batch size of each GPU.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
dist_mode (bool) – Use distributed extraction or not. Defaults to False.
persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. The argument also has effect in PyTorch>=1.7.0. Defaults to True.
- mmselfsup.utils.batch_shuffle_ddp(x)[源代码]¶
Batch shuffle, for making use of BatchNorm.
* Only support DistributedDataParallel (DDP) model. *
- mmselfsup.utils.batch_unshuffle_ddp(x, idx_unshuffle)[源代码]¶
Undo batch shuffle.
* Only support DistributedDataParallel (DDP) model. *
- mmselfsup.utils.concat_all_gather(tensor)[源代码]¶
Performs all_gather operation on the provided tensors.
* Warning *: torch.distributed.all_gather has no gradient.
- mmselfsup.utils.dist_forward_collect(func, data_loader, rank, length, ret_rank=- 1)[源代码]¶
Forward and collect network outputs in a distributed manner.
This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.
- 参数
func (function) – The function to process data. The output must be a dictionary of CPU tensors.
data_loader (Dataloader) – the torch Dataloader to yield data.
rank (int) – This process id.
length (int) – Expected length of output arrays.
ret_rank (int) – The process that returns. Other processes will return None.
- 返回
The concatenated outputs.
- 返回类型
results_all (dict(np.ndarray))
- mmselfsup.utils.distributed_sinkhorn(out, sinkhorn_iterations, world_size, epsilon)[源代码]¶
Apply the distributed sinknorn optimization on the scores matrix to find the assignments.
- mmselfsup.utils.find_latest_checkpoint(path, suffix='pth')[源代码]¶
Find the latest checkpoint from the working directory. :param path: The path to find checkpoints. :type path: str :param suffix: File extension.
Defaults to pth.
- 返回
File path of the latest checkpoint.
- 返回类型
latest_path(str | None)
引用
- 1
https://github.com/microsoft/SoftTeacher /blob/main/ssod/utils/patch.py
- 2
https://github.com/open-mmlab/mmdetection /blob/master/mmdet/utils/misc.py#L7
- mmselfsup.utils.gather_tensors_batch(input_array, part_size=100, ret_rank=- 1)[源代码]¶
batch-wise gathering to avoid CUDA out of memory.
- mmselfsup.utils.get_root_logger(log_file=None, log_level=20)[源代码]¶
Get root logger.
- 参数
log_file (str, optional) – File path of log. Defaults to None.
log_level (int, optional) – The level of logger. Defaults to logging.INFO.
- 返回
The obtained logger.
- 返回类型
logging.Logger
- mmselfsup.utils.nondist_forward_collect(func, data_loader, length)[源代码]¶
Forward and collect network outputs.
This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.
- 参数
func (function) – The function to process data. The output must be a dictionary of CPU tensors.
data_loader (Dataloader) – the torch Dataloader to yield data.
length (int) – Expected length of output arrays.
- 返回
The concatenated outputs.
- 返回类型
results_all (dict(np.ndarray))
- mmselfsup.utils.sync_random_seed(seed=None, device='cuda')[源代码]¶
Make sure different ranks share the same seed. All workers must call this function, otherwise it will deadlock. This method is generally used in DistributedSampler, because the seed should be identical across all processes in the distributed group.
In distributed sampling, different ranks should sample non-overlapped data in the dataset. Therefore, this function is used to make sure that each rank shuffles the data indices in the same order based on the same seed. Then different ranks could use different indices to select non-overlapped data from the same data list.
- 参数
seed (int, Optional) – The seed. Default to None.
device (str) – The device where the seed will be put on. Default to ‘cuda’.
- 返回
Seed to be used.
- 返回类型
int
引用
- 1
https://github.com/open-mmlab/mmdetection /blob/master/mmdet/core/utils/dist_utils.py