mmselfsup.datasets¶
datasets¶
- class mmselfsup.datasets.DeepClusterImageNet(ann_file: str = '', metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶
ImageNet Dataset.
The dataset inherit ImageNet dataset from MMClassification as the DeepCluster and Online Deep Clustering algorithm need to initialize clustering labels and assign them during training.
- Parameters
ann_file (str) – Annotation file path. Defaults to None.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to None.data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in
CustomDataset
andBaseDataset
.
- class mmselfsup.datasets.ImageList(ann_file: str, metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶
The dataset implementation for loading any image list file.
The ImageList can load an annotation file or a list of files and merge all data records to one list. If data is unlabeled, the gt_label will be set -1.
An annotation file should be provided, and each line indicates a sample:
The sample files:
data_prefix/ ├── folder_1 │ ├── xxx.png │ ├── xxy.png │ └── ... └── folder_2 ├── 123.png ├── nsdf3.png └── ...
1. If data is labeled, the annotation file (the first column is the image path and the second column is the index of category):
folder_1/xxx.png 0 folder_1/xxy.png 1 folder_2/123.png 5 folder_2/nsdf3.png 3 ... 2. If data is unlabeled, the annotation file is: :: folder_1/xxx.png folder_1/xxy.png folder_2/123.png folder_2/nsdf3.png ...
- Parameters
ann_file (str) – Annotation file path.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to None.data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in
CustomDataset
andBaseDataset
.
- class mmselfsup.datasets.Places205(ann_file: str = '', metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶
Places205 Dataset.
The dataset supports two kinds of annotation format. More details can be found in
CustomDataset
.- Parameters
ann_file (str) – Annotation file path. Defaults to None.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to None.data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in
CustomDataset
andBaseDataset
.
transforms¶
- class mmselfsup.datasets.transforms.BEiTMaskGenerator(input_size: int, num_masking_patches: int, min_num_patches: int = 4, max_num_patches: Optional[int] = None, min_aspect: float = 0.3, max_aspect: Optional[float] = None)[source]¶
Generate mask for image.
Added Keys:
mask
This module is borrowed from https://github.com/microsoft/unilm/tree/master/beit
- Parameters
input_size (int) – The size of input image.
num_masking_patches (int) – The number of patches to be masked.
min_num_patches (int) – The minimum number of patches to be masked in the process of generating mask. Defaults to 4.
max_num_patches (int, optional) – The maximum number of patches to be masked in the process of generating mask. Defaults to None.
min_aspect (float, optional) – The minimum aspect ratio of mask blocks. Defaults to 0.3.
min_aspect – The minimum aspect ratio of mask blocks. Defaults to None.
- class mmselfsup.datasets.transforms.ColorJitter(brightness: Union[float, List[float]] = 0, contrast: Union[float, List[float]] = 0, saturation: Union[float, List[float]] = 0, hue: Union[float, List[float]] = 0, backend: str = 'pillow')[source]¶
Randomly change the brightness, contrast, saturation and hue of an image.
Modified from https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py
Required Keys:
img
Modified Keys:
img
- Parameters
brightness (float or tuple of float (min, max)) – How much to jitter brightness. brightness_factor is chosen uniformly from [max(0, 1 - brightness), 1 + brightness] or the given [min, max]. Should be non negative numbers.
contrast (float or tuple of float (min, max)) – How much to jitter contrast. contrast_factor is chosen uniformly from [max(0, 1 - contrast), 1 + contrast] or the given [min, max]. Should be non negative numbers.
saturation (float or tuple of float (min, max)) – How much to jitter saturation. saturation_factor is chosen uniformly from [max(0, 1 - saturation), 1 + saturation] or the given [min, max]. Should be non negative numbers.
hue (float or tuple of float (min, max)) – How much to jitter hue. hue_factor is chosen uniformly from [-hue, hue] or the given [min, max]. Should have 0 <= hue <= 0.5 or -0.5 <= min <= max <= 0.5. To jitter hue, the pixel values of the input image has to be non-negative for conversion to HSV space; thus it does not work if you normalize your image to an interval with negative values, or use an interpolation that generates negative values before using this function.
backend (str) – The type of image processing backend. Options are cv2, pillow. Defaults to pillow.
- static get_params(brightness: Optional[List[float]], contrast: Optional[List[float]], saturation: Optional[List[float]], hue: Optional[List[float]]) → Tuple[numpy.ndarray, Optional[float], Optional[float], Optional[float], Optional[float]][source]¶
Get the parameters for the randomized transform to be applied on image.
- Parameters
brightness (tuple of float (min, max), optional) – The range from which the brightness_factor is chosen uniformly. Pass None to turn off the transformation.
contrast (tuple of float (min, max), optional) – The range from which the contrast_factor is chosen uniformly. Pass None to turn off the transformation.
saturation (tuple of float (min, max), optional) – The range from which the saturation_factor is chosen uniformly. Pass None to turn off the transformation.
hue (tuple of float (min, max), optional) – The range from which the hue_factor is chosen uniformly. Pass None to turn off the transformation.
- Returns
- The parameters used to apply the randomized transform
along with their random order.
- Return type
tuple
- class mmselfsup.datasets.transforms.MAERandomResizedCrop(size, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=<InterpolationMode.BILINEAR: 'bilinear'>, antialias: Optional[bool] = None)[source]¶
RandomResizedCrop for matching TF/TPU implementation: no for-loop is used.
This may lead to results different with torchvision’s version. Following BYOL’s TF code: https://github.com/deepmind/deepmind-research/blob/master/byol/utils/dataset.py#L206 # noqa: E501
- forward(results: dict) → dict[source]¶
The forward function of MAERandomResizedCrop.
- Parameters
results (dict) – The results dict contains the image and all these information related to the image.
- Returns
The results dict contains the cropped image and all these information related to the image.
- Return type
dict
- static get_params(img: PIL.Image.Image, scale: tuple, ratio: tuple) → Tuple[source]¶
Get parameters for
crop
for a random sized crop.- Parameters
img (PIL Image or Tensor) – Input image.
scale (list) – range of scale of the origin size cropped
ratio (list) – range of aspect ratio of the origin aspect ratio cropped
- Returns
params (i, j, h, w) to be passed to
crop
for a random sized crop.- Return type
tuple
- class mmselfsup.datasets.transforms.MultiView(transforms: List[List[Union[dict, Callable[[dict], dict]]]], num_views: Union[int, List[int]])[source]¶
A transform wrapper for multiple views of an image.
- Parameters
transforms (list[dict | callable], optional) – Sequence of transform object or config dict to be wrapped.
mapping (dict) – A dict that defines the input key mapping. The keys corresponds to the inner key (i.e., kwargs of the
transform
method), and should be string type. The values corresponds to the outer keys (i.e., the keys of the data/results), and should have a type of string, list or dict. None means not applying input mapping. Default: None.allow_nonexist_keys (bool) – If False, the outer keys in the mapping must exist in the input data, or an exception will be raised. Default: False.
Examples
>>> # Example 1: MultiViews 1 pipeline with 2 views >>> pipeline = [ >>> dict(type='MultiView', >>> num_views=2, >>> transforms=[ >>> [ >>> dict(type='Resize', scale=224))], >>> ]) >>> ] >>> # Example 2: MultiViews 2 pipelines, the first with 2 views, >>> # the second with 6 views >>> pipeline = [ >>> dict(type='MultiView', >>> num_views=[2, 6], >>> transforms=[ >>> [ >>> dict(type='Resize', scale=224)], >>> [ >>> dict(type='Resize', scale=224), >>> dict(type='RandomSolarize')], >>> ]) >>> ]
- class mmselfsup.datasets.transforms.PackSelfSupInputs(key: str = 'img', algorithm_keys: List[str] = [], pseudo_label_keys: List[str] = [], meta_keys: List[str] = [])[source]¶
Pack data into the format compatible with the inputs of algorithm.
Required Keys:
img
Added Keys:
data_samples
inputs
- Parameters
key (str) – The key of image inputted into the model. Defaults to ‘img’.
algorithm_keys (List[str]) – Keys of elements related to algorithms, e.g. mask. Defaults to [].
pseudo_label_keys (List[str]) – Keys set to be the attributes of pseudo_label. Defaults to [].
meta_keys (List[str]) – The keys of meta info of an image. Defaults to [].
- classmethod set_algorithm_keys(data_sample: mmselfsup.structures.selfsup_data_sample.SelfSupDataSample, key: str, results: dict) → None[source]¶
Set the algorithm keys of SelfSupDataSample.
- Parameters
data_sample (SelfSupDataSample) – An instance of SelfSupDataSample.
key (str) – The key, which may be used by the algorithm, such as gt_label, sample_idx, mask, pred_label. For more keys, please refer to the attribute of SelfSupDataSample.
results (dict) – The results from the data pipeline.
- transform(results: Dict) → Dict[torch.Tensor, mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
Method to pack the data.
- Parameters
results (Dict) – Result dict from the data pipeline.
- Returns
inputs
(List[torch.Tensor]): The forward data of models.data_samples
(SelfSupDataSample): The annotation info of the forward data.
- Return type
Dict
- class mmselfsup.datasets.transforms.RandomCrop(size: Union[int, Sequence[int]], padding: Optional[Union[int, Sequence[int]]] = None, pad_if_needed: bool = False, pad_val: Union[numbers.Number, Sequence[numbers.Number]] = 0, padding_mode: str = 'constant')[source]¶
Crop the given Image at a random location.
Required Keys:
img
Modified Keys:
img
img_shape
- Parameters
size (int or Sequence) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.
padding (int or Sequence, optional) – Optional padding on each border of the image. If a sequence of length 4 is provided, it is used to pad left, top, right, bottom borders respectively. If a sequence of length 2 is provided, it is used to pad left/right, top/bottom borders, respectively. Default: None, which means no padding.
pad_if_needed (boolean) – It will pad the image if smaller than the desired size to avoid raising an exception. Since cropping is done after padding, the padding seems to be done at a random offset. Default: False.
pad_val (Number | Sequence[Number]) – Pixel pad_val value for constant fill. If a tuple of length 3, it is used to pad_val R, G, B channels respectively. Default: 0.
padding_mode (str) –
Type of padding. Defaults to “constant”. Should be one of the following:
constant: Pads with a constant value, this value is specified with pad_val.
edge: pads with the last value at the edge of the image.
reflect: Pads with reflection of image without repeating the last value on the edge. For example, padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, 3, 4, 3, 2].
symmetric: Pads with reflection of image repeating the last value on the edge. For example, padding [1, 2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, 2, 3, 4, 4, 3].
- static get_params(img: numpy.ndarray, output_size: Tuple) → Tuple[source]¶
Get parameters for
crop
for a random crop.- Parameters
img (np.ndarray) – Image to be cropped.
output_size (Tuple) – Expected output size of the crop.
- Returns
- Params (xmin, ymin, target_height, target_width) to be
passed to
crop
for random crop.
- Return type
tuple
- class mmselfsup.datasets.transforms.RandomGaussianBlur(sigma_min: float, sigma_max: float, prob: Optional[float] = 0.5)[source]¶
GaussianBlur augmentation refers to SimCLR.
Required Keys:
img
Modified Keys:
img
- Parameters
sigma_min (float) – The minimum parameter of Gaussian kernel std.
sigma_max (float) – The maximum parameter of Gaussian kernel std.
prob (float, optional) – Probability. Defaults to 0.5.
- class mmselfsup.datasets.transforms.RandomPatchWithLabels[source]¶
Relative patch location.
Required Keys:
img
Modified Keys:
img
Added Keys:
patch_label
patch_box
unpatched_img
Crops image into several patches and concatenates every surrounding patch with center one. Finally gives labels 0, 1, 2, 3, 4, 5, 6, 7 and patch positions.
- class mmselfsup.datasets.transforms.RandomResizedCrop(size: Union[int, Sequence[int]], scale: Tuple = (0.08, 1.0), ratio: Tuple = (0.75, 1.3333333333333333), max_attempts: int = 10, interpolation: str = 'bilinear', backend: str = 'cv2')[source]¶
Crop the given image to random size and aspect ratio.
A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size.
Required Keys:
img
Modified Keys:
img
img_shape
- Parameters
size (Sequence | int) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.
scale (Tuple) – Range of the random size of the cropped image compared to the original image. Defaults to (0.08, 1.0).
ratio (Tuple) – Range of the random aspect ratio of the cropped image compared to the original image. Defaults to (3. / 4., 4. / 3.).
max_attempts (int) – Maximum number of attempts before falling back to Central Crop. Defaults to 10.
interpolation (str) – Interpolation method, accepted values are ‘nearest’, ‘bilinear’, ‘bicubic’, ‘area’, ‘lanczos’. Defaults to ‘bilinear’.
backend (str) – The image resize backend type, accepted values are cv2 and pillow. Defaults to cv2.
- static get_params(img: numpy.ndarray, scale: Tuple, ratio: Tuple, max_attempts: int = 10) → Tuple[int, int, int, int][source]¶
Get parameters for
crop
for a random sized crop.- Parameters
img (np.ndarray) – Image to be cropped.
scale (Tuple) – Range of the random size of the cropped image compared to the original image size.
ratio (Tuple) – Range of the random aspect ratio of the cropped image compared to the original image area.
max_attempts (int) – Maximum number of attempts before falling back to central crop. Defaults to 10.
- Returns
- Params (ymin, xmin, ymax, xmax) to be passed to crop for
a random sized crop.
- Return type
tuple
- class mmselfsup.datasets.transforms.RandomResizedCropAndInterpolationWithTwoPic(size: Union[tuple, int], second_size=None, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation='bilinear', second_interpolation='lanczos')[source]¶
Crop the given PIL Image to random size and aspect ratio with random interpolation.
Required Keys:
img
Modified Keys:
img
Added Keys:
target_img
This module is borrowed from https://github.com/microsoft/unilm/tree/master/beit.
A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size. This is popularly used to train the Inception networks. This module first crops the image and resizes the crop to two different sizes.
- Parameters
size (Union[tuple, int]) – Expected output size of each edge of the first image.
second_size (Union[tuple, int], optional) – Expected output size of each edge of the second image.
scale (tuple[float, float]) – Range of size of the origin size cropped. Defaults to (0.08, 1.0).
ratio (tuple[float, float]) – Range of aspect ratio of the origin aspect ratio cropped. Defaults to (3./4., 4./3.).
interpolation (str) – The interpolation for the first image. Defaults to
bilinear
.second_interpolation (str) – The interpolation for the second image. Defaults to
lanczos
.
- static get_params(img: numpy.ndarray, scale: tuple, ratio: tuple) → Sequence[int][source]¶
Get parameters for
crop
for a random sized crop.- Parameters
img (np.ndarray) – Image to be cropped.
scale (tuple) – range of size of the origin size cropped
ratio (tuple) – range of aspect ratio of the origin aspect ratio cropped
- Returns
- params (i, j, h, w) to be passed to
crop
for a random sized crop.
- params (i, j, h, w) to be passed to
- Return type
tuple
- transform(results: dict) → dict[source]¶
Crop the given image and resize it to two different sizes.
This module crops the given image randomly and resize the crop to two different sizes. This is popularly used in BEiT-style masked image modeling, where an off-the-shelf model is used to provide the target.
- Parameters
results (dict) – Results from previous pipeline.
- Returns
Results after applying this transformation.
- Return type
dict
- class mmselfsup.datasets.transforms.RandomRotation(degrees: Union[int, Sequence[int]], interpolation: str = 'nearest', expand: bool = False, center: Optional[Tuple[float]] = None, fill: int = 0)[source]¶
Rotate the image by angle.
Required Keys:
img
Modified Keys:
img
- Parameters
degrees (sequence | int) – Range of degrees to select from. If degrees is an int instead of sequence like (min, max), the range of degrees will be (-degrees, +degrees).
interpolation (str, optional) – Interpolation method, accepted values are ‘nearest’, ‘bilinear’, ‘bicubic’, ‘area’, ‘lanczos’. Defaults to ‘nearest’.
expand (bool, optional) – Optional expansion flag. If true, expands the output to make it large enough to hold the entire rotated image. If false or omitted, make the output image the same size as the input image. Note that the expand flag assumes rotation around the center and no translation. Defaults to False.
center (Tuple[float], optional) – Center point (w, h) of the rotation in the source image. If not specified, the center of the image will be used. Defaults to None.
fill (int, optional) – Pixel fill value for the area outside the rotated image. Default to 0.
- class mmselfsup.datasets.transforms.RandomSolarize(threshold: int = 128, prob: float = 0.5)[source]¶
Solarization augmentation refers to BYOL.
Required Keys:
img
Modified Keys:
img
- Parameters
threshold (float, optional) – The solarization threshold. Defaults to 128.
prob (float, optional) – Probability. Defaults to 0.5.
- class mmselfsup.datasets.transforms.RotationWithLabels[source]¶
Rotation prediction.
Required Keys:
img
Modified Keys:
img
Added Keys:
rot_label
Rotate each image with 0, 90, 180, and 270 degrees and give labels 0, 1, 2, 3 correspodingly.
- class mmselfsup.datasets.transforms.SimMIMMaskGenerator(input_size: int = 192, mask_patch_size: int = 32, model_patch_size: int = 4, mask_ratio: float = 0.6)[source]¶
Generate random block mask for each Image.
Added Keys:
mask
This module is used in SimMIM to generate masks.
- Parameters
input_size (int) – Size of input image. Defaults to 192.
mask_patch_size (int) – Size of each block mask. Defaults to 32.
model_patch_size (int) – Patch size of each token. Defaults to 4.
mask_ratio (float) – The mask ratio of image. Defaults to 0.6.
samplers¶
- class mmselfsup.datasets.samplers.DeepClusterSampler(dataset: Sized, shuffle: bool = True, seed: Optional[int] = None, replace: bool = False, round_up: bool = True)[source]¶
The sampler inherits
DefaultSampler
from mmengine.This sampler supports to set replace to be
True
to get indices. Besides, it defines functionset_uniform_indices
, which is applied inDeepClusterHook
.- Parameters
dataset (Sized) – The dataset.
shuffle (bool) – Whether shuffle the dataset or not. Defaults to True.
seed (int, optional) – Random seed used to shuffle the sampler if
shuffle=True
. This number should be identical across all processes in the distributed group. Defaults to None.replace (bool) – Replace or not in random shuffle. It works on when shuffle is True. Defaults to False.
round_up (bool) – Whether to add extra samples to make the number of samples evenly divisible by the world size. Defaults to True.
mmselfsup.engine¶
hooks¶
- class mmselfsup.engine.hooks.DeepClusterHook(extract_dataloader: dict, clustering: dict, unif_sampling: bool, reweight: bool, reweight_pow: float, init_memory: bool = False, initial: bool = True, interval: int = 1, seed: Optional[int] = None)[source]¶
Hook for DeepCluster.
This hook includes the global clustering process in DC.
- Parameters
extractor (dict) – Config dict for feature extraction.
clustering (dict) – Config dict that specifies the clustering algorithm.
unif_sampling (bool) – Whether to apply uniform sampling.
reweight (bool) – Whether to apply loss re-weighting.
reweight_pow (float) – The power of re-weighting.
init_memory (bool) – Whether to initialize memory banks used in ODC. Defaults to False.
initial (bool) – Whether to call the hook initially. Defaults to True.
interval (int) – Frequency of epochs to call the hook. Defaults to 1.
seed (int, optional) – Random seed. Defaults to None.
- set_reweight(runner, labels: numpy.ndarray, reweight_pow: float = 0.5)[source]¶
Loss re-weighting.
Re-weighting the loss according to the number of samples in each class.
- Parameters
runner (mmengine.Runner) – mmengine Runner.
labels (numpy.ndarray) – Label assignments.
reweight_pow (float, optional) – The power of re-weighting. Defaults to 0.5.
- class mmselfsup.engine.hooks.DenseCLHook(start_iters: int = 1000)[source]¶
Hook for DenseCL.
This hook includes
loss_lambda
warmup in DenseCL. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL.- Parameters
start_iters (int) – The number of warmup iterations to set
loss_lambda=0
. Defaults to 1000.
- class mmselfsup.engine.hooks.ODCHook(centroids_update_interval: int, deal_with_small_clusters_interval: int, evaluate_interval: int, reweight: bool, reweight_pow: float, dist_mode: bool = True)[source]¶
Hook for ODC.
This hook includes the online clustering process in ODC.
- Parameters
centroids_update_interval (int) – Frequency of iterations to update centroids.
deal_with_small_clusters_interval (int) – Frequency of iterations to deal with small clusters.
evaluate_interval (int) – Frequency of iterations to evaluate clusters.
reweight (bool) – Whether to perform loss re-weighting.
reweight_pow (float) – The power of re-weighting.
dist_mode (bool) – Use distributed training or not. Defaults to True.
- after_train_iter(runner, batch_idx: int, data_batch: Optional[Sequence[dict]] = None, outputs: Optional[dict] = None) → None[source]¶
Update cluster centroids and the loss_weight.
- set_reweight(runner, labels: Optional[numpy.ndarray] = None, reweight_pow: float = 0.5)[source]¶
Loss re-weighting.
Re-weighting the loss according to the number of samples in each class.
- Parameters
runner (mmengine.Runner) – mmengine Runner.
labels (numpy.ndarray) – Label assignments.
reweight_pow (float, optional) – The power of re-weighting. Defaults to 0.5.
- class mmselfsup.engine.hooks.SimSiamHook(fix_pred_lr: bool, lr: float, adjust_by_epoch: Optional[bool] = True)[source]¶
Hook for SimSiam.
This hook is for SimSiam to fix learning rate of predictor.
- Parameters
fix_pred_lr (bool) – whether to fix the lr of predictor or not.
lr (float) – the value of fixed lr.
adjust_by_epoch (bool, optional) – whether to set lr by epoch or iter. Defaults to True.
- class mmselfsup.engine.hooks.SwAVHook(batch_size: int, epoch_queue_starts: Optional[int] = 15, crops_for_assign: Optional[List[int]] = [0, 1], feat_dim: Optional[int] = 128, queue_length: Optional[int] = 0, interval: Optional[int] = 1, frozen_layers_cfg: Optional[Dict] = {})[source]¶
Hook for SwAV.
This hook builds the queue in SwAV according to
epoch_queue_starts
. The queue will be saved inrunner.work_dir
or loaded at start epoch if the path folder has queues saved before.- Parameters
batch_size (int) – the batch size per GPU for computing.
epoch_queue_starts (int, optional) – from this epoch, starts to use the queue. Defaults to 15.
crops_for_assign (list[int], optional) – list of crops id used for computing assignments. Defaults to [0, 1].
feat_dim (int, optional) – feature dimension of output vector. Defaults to 128.
queue_length (int, optional) – length of the queue (0 for no queue). Defaults to 0.
interval (int, optional) – the interval to save the queue. Defaults to 1.
frozen_layers_cfg (dict, optional) – Dict to config frozen layers. The key-value pair is layer name and its frozen iters. If frozen, the layers don’t need gradient. Defaults to dict().
optimizers¶
- class mmselfsup.engine.optimizers.LARS(params: Iterable, lr: float, momentum: float = 0, weight_decay: float = 0, dampening: float = 0, eta: float = 0.001, nesterov: bool = False, eps: float = 1e-08)[source]¶
Implements layer-wise adaptive rate scaling for SGD.
Based on Algorithm 1 of the following paper by You, Gitman, and Ginsburg. Large Batch Training of Convolutional Networks:.
- Parameters
params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – Base learning rate.
momentum (float) – Momentum factor. Defaults to 0.
weight_decay (float) – Weight decay (L2 penalty). Defaults to 0.
dampening (float) – Dampening for momentum. Defaults to 0.
eta (float) – LARS coefficient. Defaults to 0.001.
nesterov (bool) – Enables Nesterov momentum. Defaults to False.
eps (float) – A small number to avoid dviding zero. Defaults to 1e-8.
Example
>>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9, >>> weight_decay=1e-4, eta=1e-3) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
- class mmselfsup.engine.optimizers.LearningRateDecayOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: Optional[dict] = None)[source]¶
Different learning rates are set for different layers of backbone.
Note: Currently, this optimizer constructor is built for ViT and Swin.
In addition to applying layer-wise learning rate decay schedule, the paramwise_cfg only supports weight decay customization.
- add_params(params: List[dict], module: torch.nn.modules.module.Module, optimizer_cfg: dict, **kwargs) → None[source]¶
Add all parameters of module to the params list.
The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.
- Parameters
params (List[dict]) – A list of param groups, it will be modified in place.
module (nn.Module) – The module to be added.
optimizer_cfg (dict) – The configuration of optimizer.
prefix (str) – The prefix of the module.
mmselfsup.evaluation¶
functional¶
- mmselfsup.evaluation.functional.knn_eval(train_features: torch.Tensor, train_labels: torch.Tensor, test_features: torch.Tensor, test_labels: torch.Tensor, k: int, T: float, num_classes: int = 1000) → Tuple[float, float][source]¶
Compute accuracy of knn classifier predictions.
- Parameters
train_features (Tensor) – Extracted features in the training set.
train_labels (Tensor) – Labels in the training set.
test_features (Tensor) – Extracted features in the testing set.
test_labels (Tensor) – Labels in the testing set.
k (int) – Number of NN to use.
T (float) – Temperature used in the voting coefficient.
num_classes (int) – Number of classes. Defaults to 1000.
- Returns
The top1 and top5 accuracy.
- Return type
Tuple[float, float]
mmselfsup.models¶
algorithms¶
- class mmselfsup.models.algorithms.BEiT(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
BEiT v1/v2.
Implementation of BEiT: BERT Pre-Training of Image Transformers and BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.
- loss(batch_inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
batch_inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.BYOL(backbone: dict, neck: dict, head: dict, base_momentum: float = 0.996, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
BYOL.
Implementation of Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.996.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
batch_inputs (List[torch.Tensor]) – The input images.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.BarlowTwins(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
BarlowTwins.
Implementation of Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Part of the code is borrowed from: https://github.com/facebookresearch/barlowtwins/blob/main/main.py.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.BaseModel(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
BaseModel for SelfSup.
All algorithms should inherit this module.
- Parameters
backbone (dict) – The backbone module. See
mmcls.models.backbones
.neck (dict, optional) – The neck module to process features from backbone. See
mmcls.models.necks
. Defaults to None.head (dict, optional) – The head module to do prediction and calculate loss from processed features. See
mmcls.models.heads
. Notice that if the head is not set, almost all methods cannot be used exceptextract_feat()
. Defaults to None.target_generator – (dict, optional): The target_generator module to generate targets for self-supervised learning optimization, such as HOG, extracted features from other modules(DALL-E, CLIP), etc.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (Union[dict, nn.Module], optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (dict, optional) – the config to control the initialization. Defaults to None.
- extract_feat(inputs: torch.Tensor)[source]¶
Extract features from the input tensor with shape (N, C, …).
This is a abstract method, and subclass should overwrite this methods if needed.
- Parameters
inputs (Tensor) – A batch of inputs. The shape of it should be
(num_samples, num_channels, *img_shape)
.- Returns
The output of specified stage. The output depends on detailed implementation.
- Return type
tuple | Tensor
- forward(inputs: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, mode: str = 'tensor')[source]¶
Returns losses or predictions of training, validation, testing, and simple inference process.
This module overwrites the abstract method in
BaseModel
.- Parameters
inputs (torch.Tensor) – batch input tensor collated by
data_preprocessor
.data_samples (List[BaseDataElement], optional) – data samples collated by
data_preprocessor
.mode (str) –
mode should be one of
loss
,predict
andtensor
.loss
: Called bytrain_step
and return lossdict
used for loggingpredict
: Called byval_step
andtest_step
and return list ofBaseDataElement
results used for computing metric.tensor
: Called by custom use to getTensor
type results.
- Returns
If
mode == loss
, return adict
of loss tensor used for backward and logging.If
mode == predict
, return alist
ofBaseDataElement
for computing metric and getting inference result.If
mode == tensor
, return a tensor ortuple
of tensor or ``dict of tensor for custom use.
- Return type
ForwardResults (dict or list)
- loss(inputs: torch.Tensor, data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]) → dict[source]¶
Calculate losses from a batch of inputs and data samples.
This is a abstract method, and subclass should overwrite this methods if needed.
- Parameters
inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[SelfSupDataSample]) – The annotation data of every samples.
- Returns
A dictionary of loss components.
- Return type
dict[str, Tensor]
- predict(inputs: tuple, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
Predict results from the extracted features.
This module returns the logits before loss, which are used to compute all kinds of metrics. This is a abstract method, and subclass should overwrite this methods if needed.
- Parameters
feats (tuple) – The features extracted from the backbone.
data_samples (List[BaseDataElement], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the
predict
method ofhead
.
- property with_head: bool¶
Check if the model has a head module.
- property with_neck: bool¶
Check if the model has a neck module.
- property with_target_generator: bool¶
Check if the model has a target_generator module.
- class mmselfsup.models.algorithms.CAE(backbone: dict, neck: dict, head: dict, target_generator: Optional[dict] = None, base_momentum: float = 0.0, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
CAE.
Implementation of Context Autoencoder for Self-Supervised Representation Learning.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of neck.
head (dict) – Config dict for module of head functions.
target_generator – (dict, optional): The target_generator module to generate targets for self-supervised learning optimization, such as HOG, extracted features from other modules(DALL-E, CLIP), etc.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.0.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.DeepCluster(backbone: dict, neck: dict, head: dict, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
DeepCluster.
Implementation of Deep Clustering for Unsupervised Learning of Visual Features. The clustering operation is in engine/hooks/deepcluster_hook.py.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
The forward function in testing.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
List[SelfSupDataSample]
- class mmselfsup.models.algorithms.DenseCL(backbone: dict, neck: dict, head: dict, queue_len: int = 65536, feat_dim: int = 128, momentum: float = 0.999, loss_lambda: float = 0.5, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
DenseCL.
Implementation of Dense Contrastive Learning for Self-Supervised Visual Pre-Training. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL. The loss_lambda warmup is in engine/hooks/densecl_hook.py.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
loss_lambda (float) – Loss weight for the single and dense contrastive loss. Defaults to 0.5.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample[source]¶
Predict results from the extracted features.
- Parameters
batch_inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
- class mmselfsup.models.algorithms.EVA(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
EVA.
Implementation of EVA: Exploring the Limits of Masked Visual Representation Learning at Scale.
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.MAE(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
MAE.
Implementation of Masked Autoencoders Are Scalable Vision Learners.
- extract_feat(inputs: List[torch.Tensor], data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwarg) → Tuple[torch.Tensor][source]¶
The forward function to extract features from neck.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
Neck outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- reconstruct(features: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample[source]¶
The function is for image reconstruction.
- Parameters
features (torch.Tensor) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
- class mmselfsup.models.algorithms.MILAN(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
MILAN.
Implementation of MILAN: Masked Image Pretraining on Language Assisted Representation.
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.MaskFeat(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
MaskFeat.
Implementation of Masked Feature Prediction for Self-Supervised Visual Pre-Training.
- extract_feat(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], compute_hog: bool = True, **kwarg) → Tuple[torch.Tensor][source]¶
The forward function to extract features from neck.
- Parameters
inputs (List[torch.Tensor]) – The input images and mask.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
compute_hog (bool) – Whether to compute hog during extraction. If True, the batch size of inputs need to be 1. Defaults to True.
- Returns
Neck outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- reconstruct(features: List[torch.Tensor], data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample[source]¶
The function is for image reconstruction.
- Parameters
features (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
- class mmselfsup.models.algorithms.MixMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
MiXMIM.
Implementation of MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning..
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.MoCo(backbone: dict, neck: dict, head: dict, queue_len: int = 65536, feat_dim: int = 128, momentum: float = 0.999, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
MoCo.
Implementation of Momentum Contrast for Unsupervised Visual Representation Learning. Part of the code is borrowed from: https://github.com/facebookresearch/moco/blob/master/moco/builder.py.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.MoCoV3(backbone: dict, neck: dict, head: dict, base_momentum: float = 0.99, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
MoCo v3.
Implementation of An Empirical Study of Training Self-Supervised Vision Transformers.
- Parameters
backbone (dict) – Config dict for module of backbone
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
base_momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.99.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.NPID(backbone: dict, neck: dict, head: dict, memory_bank: dict, neg_num: int = 65536, ensure_neg: bool = False, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
NPID.
Implementation of Unsupervised Feature Learning via Non-parametric Instance Discrimination.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
memory_bank (dict) – Config dict for module of memory bank.
neg_num (int) – Number of negative samples for each image. Defaults to 65536.
ensure_neg (bool) – If False, there is a small probability that negative samples contain positive ones. Defaults to False.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, Tensor]
- class mmselfsup.models.algorithms.ODC(backbone: dict, neck: dict, head: dict, memory_bank: dict, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
ODC.
Official implementation of Online Deep Clustering for Unsupervised Representation Learning. The operation w.r.t. memory bank and loss re-weighting is in engine/hooks/odc_hook.py.
- Parameters
backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
memory_bank (dict) – Config dict for module of memory bank.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See
SelfSupDataPreprocessor
for more details. Defaults to None.init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
The forward function in testing.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
List[SelfSupDataSample]
- class mmselfsup.models.algorithms.PixMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
The official implementation of PixMIM.
Implementation of PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling.
Please refer to MAE for these initialization arguments.
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.RelativeLoc(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
Relative patch location.
Implementation of Unsupervised Visual Representation Learning by Context Prediction.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
The forward function in testing.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
List[SelfSupDataSample]
- class mmselfsup.models.algorithms.RotationPred(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
Rotation prediction.
Implementation of Unsupervised Representation Learning by Predicting Image Rotations.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶
The forward function in testing.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
List[SelfSupDataSample]
- class mmselfsup.models.algorithms.SimCLR(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
SimCLR.
Implementation of A Simple Framework for Contrastive Learning of Visual Representations.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
- class mmselfsup.models.algorithms.SimMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
SimMIM.
Implementation of SimMIM: A Simple Framework for Masked Image Modeling.
- extract_feat(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwarg) → torch.Tensor[source]¶
The forward function to extract features.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The reconstructed images.
- Return type
torch.Tensor
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, Tensor]
- reconstruct(features: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample[source]¶
The function is for image reconstruction.
- Parameters
features (torch.Tensor) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
The prediction from model.
- Return type
- class mmselfsup.models.algorithms.SimSiam(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
SimSiam.
Implementation of Exploring Simple Siamese Representation Learning. The operation of fixing learning rate of predictor is in engine/hooks/simsiam_hook.py.
- extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
Backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
The forward function in training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, Tensor]
- class mmselfsup.models.algorithms.SwAV(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶
SwAV.
Implementation of Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. The queue is built in engine/hooks/swav_hook.py.
- extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶
Function to extract features from backbone.
- Parameters
inputs (List[torch.Tensor]) – The input images.
- Returns
backbone outputs.
- Return type
Tuple[torch.Tensor]
- loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶
Forward computation during training.
- Parameters
inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
- Returns
A dictionary of loss components.
- Return type
Dict[str, torch.Tensor]
backbones¶
- class mmselfsup.models.backbones.BEiTViT(arch: str = 'base', img_size: int = 224, patch_size: int = 16, in_channels: int = 3, out_indices: int = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, avg_token: bool = False, frozen_stages: int = - 1, output_cls_token: bool = True, use_abs_pos_emb: bool = False, use_rel_pos_bias: bool = False, use_shared_rel_pos_bias: bool = True, layer_scale_init_value: int = 0.1, interpolate_mode: str = 'bicubic', patch_cfg: dict = {'padding': 0}, layer_cfgs: dict = {}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Vision Transformer for BEiT pre-training.
Rewritten version of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Parameters
arch (str | dict) –
Vision Transformer architecture. If use string, choose from ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:
embed_dims (int): The dimensions of embedding.
num_layers (int): The number of transformer encoder layers.
num_heads (int): The number of heads in attention modules.
feedforward_channels (int): The hidden dimensions in feedforward modules.
Defaults to ‘base’.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True.
avg_token (bool) – Whether or not to use the mean patch token for classification. If True, the model will only take the average of all patch tokens. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
output_cls_token (bool) – Whether output the cls_token. If set True,
with_cls_token
must be True. Defaults to True.use_abs_pos_emb (bool) – Whether or not use absolute position embedding. Defaults to False.
use_rel_pos_bias (bool) – Whether or not use relative position bias. Defaults to False.
use_shared_rel_pos_bias (bool) – Whether or not use shared relative position bias. Defaults to True.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor][source]¶
The BEiT style forward function.
- Parameters
x (torch.Tensor) – Input images, which is of shape (B x C x H x W).
mask (torch.Tensor) – Mask for input, which is of shape (B x patch_resolution[0] x patch_resolution[1]).
- Returns
Hidden features.
- Return type
Tuple[torch.Tensor]
- class mmselfsup.models.backbones.CAEViT(arch: str = 'b', img_size: int = 224, patch_size: int = 16, out_indices: int = - 1, drop_rate: float = 0, drop_path_rate: float = 0, qkv_bias: bool = True, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', init_values: Optional[float] = None, patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[dict] = None)[source]¶
Vision Transformer for CAE pre-training.
Rewritten version of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Parameters
arch (str | dict) – Vision Transformer architecture. Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
init_values (float, optional) – The init value of gamma in TransformerEncoderLayer.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(img: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Generate features for masked images.
This function generates mask images and get the hidden features for visible patches.
- Parameters
x (torch.Tensor) – Input images, which is of shape B x C x H x W.
mask (torch.Tensor) – Mask for input, which is of shape B x L.
- Returns
hidden features.
- Return type
torch.Tensor
- class mmselfsup.models.backbones.MAEViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, mask_ratio: float = 0.75, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Vision Transformer for MAE pre-training.
A PyTorch implement of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This module implements the patch masking in MAE and initialize the position embedding with sine-cosine position embedding.
- Parameters
arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
mask_ratio (bool) – The ratio of total number of patches to be masked. Defaults to 0.75.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Generate features for masked images.
This function generates mask and masks some patches randomly and get the hidden features for visible patches.
- Parameters
x (torch.Tensor) – Input images, which is of shape B x C x H x W.
- Returns
Hidden features, mask and the ids to restore original image.
x (torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.
mask (torch.Tensor): mask used to mask image.
ids_restore (torch.Tensor): ids to restore original image.
- Return type
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- random_masking(x: torch.Tensor, mask_ratio: float = 0.75) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Generate the mask for MAE Pre-training.
- Parameters
x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.75.
- Returns
- masked image, mask and the ids to restore original image.
x_masked (torch.Tensor): masked image.
mask (torch.Tensor): mask used to mask image.
ids_restore (torch.Tensor): ids to restore original image.
- Return type
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- class mmselfsup.models.backbones.MILANViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, mask_ratio: float = 0.75, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
MILANViT.
Implementation of the encoder for MILAN: Masked Image Pretraining on Language Assisted Representation. This module inherits from MAEViT and only overrides the forward function and replace random masking with attention masking.
- attention_masking(x: torch.Tensor, mask_ratio: float, importance: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Generate attention mask for MILAN.
This is what is different from MAEViT, which uses random masking. Attention masking generates attention mask for MILAN, according to importance. The higher the importance, the more likely the patch is kept.
- Parameters
x (torch.Tensor) – Input images, which is of shape B x L x C.
mask_ratio (float) – The ratio of patches to be masked.
importance (torch.Tensor) – Importance of each patch, which is of shape B x L.
- Returns
masked image, mask, the ids to restore original image, ids of the shuffled patches, ids of the kept patches, ids of the removed patches.
- Return type
Tuple[torch.Tensor, …]
- forward(x: torch.Tensor, importance: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Generate features for masked images.
This function generates mask and masks some patches randomly and get the hidden features for visible patches. The mask is generated by importance. The higher the importance, the more likely the patch is kept. The importance is calculated by CLIP. The higher the CLIP score, the more likely the patch is kept. The CLIP score is calculated by by cross attention between the class token and all other tokens from the last layer.
- Parameters
x (torch.Tensor) – Input images, which is of shape B x C x H x W.
importance (torch.Tensor) – Importance of each patch, which is of shape B x L.
- Returns
masked image, the ids to restore original image, ids of the kept patches, ids of the removed patches.
x (torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.
ids_restore (torch.Tensor): ids to restore original image.
ids_keep (torch.Tensor): ids of the kept patches.
ids_dump (torch.Tensor): ids of the removed patches.
- Return type
Tuple[torch.Tensor, …]
- class mmselfsup.models.backbones.MaskFeatViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Vision Transformer for MaskFeat pre-training.
A PyTorch implement of: Masked Feature Prediction for Self-Supervised Visual Pre-Training.
- Parameters
arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.backbones.MixMIMTransformerPretrain(arch: Union[str, dict] = 'base', mlp_ratio: float = 4, img_size: int = 224, patch_size: int = 4, in_channels: int = 3, window_size: List = [14, 14, 14, 7], qkv_bias: bool = True, patch_cfg: dict = {}, norm_cfg: dict = {'type': 'LN'}, drop_rate: float = 0.0, drop_path_rate: float = 0.0, attn_drop_rate: float = 0.0, use_checkpoint: bool = False, range_mask_ratio: float = 0.0, init_cfg: Optional[dict] = None)[source]¶
MixMIM backbone during pretraining.
A PyTorch implement of : ` MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning <https://arxiv.org/abs/2205.13137>`_
- Parameters
arch (str | dict) –
MixMIM architecture. If use string, choose from ‘base’,’large’ and ‘huge’. If use dict, it should have below keys:
embed_dims (int): The dimensions of embedding.
depths (int): The number of transformer encoder layers.
num_heads (int): The number of heads in attention modules.
Defaults to ‘base’.
mlp_ratio (int) – The mlp ratio in FFN. Defaults to 4.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to mlp_ratio the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
window_size (list) – The height and width of the window.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to an empty dict.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
attn_drop_rate (float) – Attention drop rate. Defaults to 0.
use_checkpoint (bool) – Whether use the checkpoint to
GPU memory cost (reduce) –
range_mask_ratio (float) – The range of mask ratio. Defaults to 0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor, mask_ratio=0.5)[source]¶
Generate features for masked images.
This function generates mask and masks some patches randomly and get the hidden features for visible patches.
- Parameters
x (torch.Tensor) – Input images, which is of shape B x C x H x W.
- Returns
x (torch.Tensor): hidden features, which is of shape B x L x C.
mask_s4 (torch.Tensor): the mask tensor for the last layer.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- random_masking(x: torch.Tensor, mask_ratio: float = 0.5)[source]¶
Generate the mask for MixMIM Pretraining.
- Parameters
x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.5.
- Returns
mask_s1 (torch.Tensor): mask with stride of self.encoder_stride // 8.
mask_s2 (torch.Tensor): mask with stride of self.encoder_stride // 4.
mask_s3 (torch.Tensor): mask with stride of self.encoder_stride // 2.
mask (torch.Tensor): mask with stride of self.encoder_stride.
- Return type
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
- class mmselfsup.models.backbones.MoCoV3ViT(stop_grad_conv1: bool = False, frozen_stages: int = - 1, norm_eval: bool = False, init_cfg: Optional[Union[dict, List[dict]]] = None, **kwargs)[source]¶
Vision Transformer.
A pytorch implement of: An Images is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/vits.py.
- Parameters
stop_grad_conv1 (bool) – whether to stop the gradient of convolution layer in PatchEmbed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.backbones.ResNeXt(depth: int, groups: int = 32, width_per_group: int = 4, **kwargs)[source]¶
ResNeXt backbone.
Please refer to the paper for details.
As the behavior of forward function in MMSelfSup is different from MMCls, we register our own ResNeXt, inheriting from mmselfsup.model.backbone.ResNet.
- Parameters
depth (int) – Network depth, from {50, 101, 152}.
groups (int) – Groups of conv2 in Bottleneck. Defaults to 32.
width_per_group (int) – Width per group of conv2 in Bottleneck. Defaults to 4.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Defaults to
(3, )
.style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.
Example
>>> from mmselfsup.models import ResNeXt >>> import torch >>> self = ResNeXt(depth=50) >>> self.eval() >>> inputs = torch.rand(1, 3, 32, 32) >>> level_outputs = self.forward(inputs) >>> for level_out in level_outputs: ... print(tuple(level_out.shape)) (1, 256, 8, 8) (1, 512, 4, 4) (1, 1024, 2, 2) (1, 2048, 1, 1)
- class mmselfsup.models.backbones.ResNet(depth: int, in_channels: int = 3, stem_channels: int = 64, base_channels: int = 64, expansion: Optional[int] = None, num_stages: int = 4, strides: Tuple[int] = (1, 2, 2, 2), dilations: Tuple[int] = (1, 1, 1, 1), out_indices: Tuple[int] = (4), style: str = 'pytorch', deep_stem: bool = False, avg_down: bool = False, frozen_stages: int = - 1, conv_cfg: Optional[dict] = None, norm_cfg: Optional[dict] = {'requires_grad': True, 'type': 'BN'}, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = False, init_cfg: Optional[dict] = [{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}], drop_path_rate: float = 0.0, **kwargs)[source]¶
ResNet backbone.
Please refer to the paper for details.
- Parameters
depth (int) – Network depth, from {18, 34, 50, 101, 152}.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
base_channels (int) – Middle channels of the first stage. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.out_indices (Sequence[int]) – Output from which stages. Defaults to
(4, )
.style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.
of the path to be zeroed. Defaults to 0.1 (Probability) –
Example
>>> from mmselfsup.models import ResNet >>> import torch >>> self = ResNet(depth=18) >>> self.eval() >>> inputs = torch.rand(1, 3, 32, 32) >>> level_outputs = self.forward(inputs) >>> for level_out in level_outputs: ... print(tuple(level_out.shape)) (1, 64, 8, 8) (1, 128, 4, 4) (1, 256, 2, 2) (1, 512, 1, 1)
- class mmselfsup.models.backbones.ResNetSobel(**kwargs)[source]¶
ResNet with Sobel layer.
This variant is used in clustering-based methods like DeepCluster to avoid color shortcut.
- class mmselfsup.models.backbones.ResNetV1d(**kwargs)[source]¶
ResNetV1d variant described in Bag of Tricks.
Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is changed to 1.
- class mmselfsup.models.backbones.SimMIMSwinTransformer(arch: Union[str, dict] = 'T', img_size: Union[Tuple[int, int], int] = 224, in_channels: int = 3, drop_rate: float = 0.0, drop_path_rate: float = 0.1, out_indices: tuple = (3), use_abs_pos_embed: bool = False, with_cp: bool = False, frozen_stages: bool = - 1, norm_eval: bool = False, norm_cfg: dict = {'type': 'LN'}, stage_cfgs: Union[Sequence, dict] = {}, patch_cfg: dict = {}, pad_small_map: bool = False, init_cfg: Optional[dict] = None)[source]¶
Swin Transformer for SimMIM.
- Parameters
Args –
arch (str | dict) – Swin Transformer architecture Defaults to ‘T’.
img_size (int | tuple) – The size of input image. Defaults to 224.
in_channels (int) – The num of input channels. Defaults to 3.
drop_rate (float) – Dropout rate after embedding. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
out_indices (tuple) – Layers to be outputted. Defaults to (3, ).
use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
norm_cfg (dict) – Config dict for normalization layer at end of backone. Defaults to dict(type=’LN’)
stage_cfgs (Sequence | dict) – Extra config dict for each stage. Defaults to empty dict.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to empty dict.
pad_small_map (bool) – If True, pad the small feature map to the window size, which is common used in detection and segmentation. If False, avoid shifting window and shrink the window size to the size of feature map, which is common used in classification. Defaults to False.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.
- forward(x: torch.Tensor, mask: torch.Tensor) → Sequence[torch.Tensor][source]¶
Generate features for masked images.
This function generates mask images and get the hidden features for them.
- Parameters
x (torch.Tensor) – Input images.
mask (torch.Tensor) – Masks used to construct masked images.
- Returns
A tuple containing features from multi-stages.
- Return type
tuple
necks¶
- class mmselfsup.models.necks.AvgPool2dNeck(output_size: int = 1)[source]¶
The average pooling 2d neck.
- class mmselfsup.models.necks.BEiTV2Neck(num_layers: int = 2, early_layers: int = 9, backbone_arch: str = 'base', drop_rate: float = 0.0, drop_path_rate: float = 0.0, layer_scale_init_value: float = 0.1, use_rel_pos_bias: bool = False, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶
Neck for BEiTV2 Pre-training.
This module construct the decoder for the final prediction.
- Parameters
num_layers (int) – Number of encoder layers of neck. Defaults to 2.
early_layers (int) – The layer index of the early output from the backbone. Defaults to 9.
backbone_arch (str) – Vision Transformer architecture. Defaults to base.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
use_rel_pos_bias (bool) – Whether to use unique relative position bias, if False, use shared relative position bias defined in backbone.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(inputs: Tuple[torch.Tensor], rel_pos_bias: torch.Tensor, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶
Get the latent prediction and final prediction.
- Parameters
x (Tuple[torch.Tensor]) – Features of tokens.
rel_pos_bias (torch.Tensor) – Shared relative position bias table.
- Returns
x
: The final layer features from backbone, which are normed inBEiTV2Neck
.x_cls_pt
: The early state features from backbone, which are consist of final layer cls_token and early state patch_tokens from backbone and sent to PatchAggregation layers in the neck.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.necks.CAENeck(patch_size: int = 16, num_classes: int = 8192, embed_dims: int = 768, regressor_depth: int = 6, decoder_depth: int = 8, num_heads: int = 12, mlp_ratio: int = 4, qkv_bias: bool = True, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_values: Optional[float] = None, mask_tokens_num: int = 75, init_cfg: Optional[dict] = None)[source]¶
Neck for CAE Pre-training.
This module construct the latent prediction regressor and the decoder for the latent prediction and final prediction.
- Parameters
patch_size (int) – The patch size of each token. Defaults to 16.
num_classes (int) – The number of classes for final prediction. Defaults to 8192.
embed_dims (int) – The embed dims of latent feature in regressor and decoder. Defaults to 768.
regressor_depth (int) – The number of regressor blocks. Defaults to 6.
decoder_depth (int) – The number of decoder blocks. Defaults to 8.
num_heads (int) – The number of head in multi-head attention. Defaults to 12.
mlp_ratio (int) – The expand ratio of latent features in MLP. defaults to 4.
qkv_bias (bool) – Whether or not to use qkv bias. Defaults to True.
qk_scale (float, optional) – The scale applied to the results of qk. Defaults to None.
drop_rate (float) – The dropout rate. Defaults to 0.
attn_drop_rate (float) – The dropout rate in attention block. Defaults to 0.
norm_cfg (dict) – The config of normalization layer. Defaults to dict(type=’LN’, eps=1e-6).
init_values (float, optional) – The init value of gamma. Defaults to None.
mask_tokens_num (int) – The number of mask tokens. Defaults to 75.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- forward(x_unmasked: torch.Tensor, pos_embed_masked: torch.Tensor, pos_embed_unmasked: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Get the latent prediction and final prediction.
- Parameters
x_unmasked (torch.Tensor) – Features of unmasked tokens.
pos_embed_masked (torch.Tensor) – Position embedding of masked tokens.
pos_embed_unmasked (torch.Tensor) – Position embedding of unmasked tokens.
- Returns
- Final prediction and latent
prediction.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.necks.ClsBatchNormNeck(input_features: int, affine: bool = False, eps: float = 1e-06, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Normalize cls token across batch before head.
This module is proposed by MAE, when running linear probing.
- Parameters
input_features (int) – The dimension of features.
affine (bool) – a boolean value that when set to
True
, this module has learnable affine parameters. Defaults to False.eps (float) – a value added to the denominator for numerical stability. Defaults to 1e-6.
init_cfg (Dict or List[Dict], optional) – Config dict for weight initialization. Defaults to None.
- class mmselfsup.models.necks.DenseCLNeck(in_channels: int, hid_channels: int, out_channels: int, num_grid: Optional[int] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
The non-linear neck of DenseCL.
Single and dense neck in parallel: fc-relu-fc, conv-relu-conv. Borrowed from the authors’ code.
- Parameters
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_grid (int) – The grid size of dense features. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶
Forward function of neck.
- Parameters
x (List[torch.Tensor]) – feature map of backbone.
- Returns
- The global feature
vectors and dense feature vectors. - avgpooled_x: Global feature vectors. - x: Dense feature vectors. - avgpooled_x2: Dense feature vectors for queue.
- Return type
List[torch.Tensor, torch.Tensor, torch.Tensor]
- class mmselfsup.models.necks.LinearNeck(in_channels: int, out_channels: int, with_avg_pool: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
The linear neck: fc only.
- Parameters
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.necks.MAEPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, predict_feature_dim: Optional[float] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Decoder for MAE Pre-training.
Some of the code is borrowed from https://github.com/facebookresearch/mae. # noqa
- Parameters
num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.
Example
>>> from mmselfsup.models import MAEPretrainDecoder >>> import torch >>> self = MAEPretrainDecoder() >>> self.eval() >>> inputs = torch.rand(1, 50, 1024) >>> ids_restore = torch.arange(0, 196).unsqueeze(0) >>> level_outputs = self.forward(inputs, ids_restore) >>> print(tuple(level_outputs.shape)) (1, 196, 768)
- property decoder_norm¶
The normalization layer of decoder.
- forward(x: torch.Tensor, ids_restore: torch.Tensor) → torch.Tensor[source]¶
The forward function.
The process computes the visible patches’ features vectors and the mask tokens to output feature vectors, which will be used for reconstruction.
- Parameters
x (torch.Tensor) – hidden features, which is of shape B x (L * mask_ratio) x C.
ids_restore (torch.Tensor) – ids to restore original image.
- Returns
- The reconstructed feature vectors, which is of
shape B x (num_patches) x C.
- Return type
x (torch.Tensor)
- class mmselfsup.models.necks.MILANPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, predict_feature_dim: int = 512, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Prompt decoder for MILAN.
This decoder is used in MILAN pretraining, which will not update these visible tokens from the encoder.
- Parameters
num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
predict_feature_dim (int) – The dimension of the feature to be predicted. Defaults to 512.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor, ids_restore: torch.Tensor, ids_keep: torch.Tensor, ids_dump: torch.Tensor) → torch.Tensor[source]¶
Forward function.
- Parameters
x (torch.Tensor) – The input features, which is of shape (N, L, C).
ids_restore (torch.Tensor) – The indices to restore these tokens to the original image.
ids_keep (torch.Tensor) – The indices of tokens to be kept.
ids_dump (torch.Tensor) – The indices of tokens to be masked.
- Returns
- The reconstructed features, which is of shape
(N, L, C).
- Return type
torch.Tensor
- class mmselfsup.models.necks.MixMIMPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, encoder_stride: int = 32, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Decoder for MixMIM Pretraining.
Some of the code is borrowed from https://github.com/Sense-X/MixMIM. # noqa
- Parameters
num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
encoder_stride (int) – The output stride of MixMIM backbone. Defaults to 32.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.
- forward(x: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function.
- Parameters
x (torch.Tensor) – The input features, which is of shape (N, L, C).
mask (torch.Tensor) – The tensor to indicate which tokens a re masked.
- Returns
- The reconstructed features, which is of shape
(N, L, C).
- Return type
torch.Tensor
- class mmselfsup.models.necks.MoCoV2Neck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
The non-linear neck of MoCo v2: fc-relu-fc.
- Parameters
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.necks.NonLinearNeck(in_channels: int, hid_channels: int, out_channels: int, num_layers: int = 2, with_bias: bool = False, with_last_bn: bool = True, with_last_bn_affine: bool = True, with_last_bias: bool = False, with_avg_pool: bool = True, vit_backbone: bool = False, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
The non-linear neck.
Structure: fc-bn-[relu-fc-bn] where the substructure in [] can be repeated. For the default setting, the repeated time is 1. The neck can be used in many algorithms, e.g., SimCLR, BYOL, SimSiam.
- Parameters
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_layers (int) – Number of fc layers. Defaults to 2.
with_bias (bool) – Whether to use bias in fc layers (except for the last). Defaults to False.
with_last_bn (bool) – Whether to add the last BN layer. Defaults to True.
with_last_bn_affine (bool) – Whether to have learnable affine parameters in the last BN layer (set False for SimSiam). Defaults to True.
with_last_bias (bool) – Whether to use bias in the last fc layer. Defaults to False.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
vit_backbone (bool) – The key to indicate whether the upstream backbone is ViT. Defaults to False.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.necks.ODCNeck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
The non-linear neck of ODC: fc-bn-relu-dropout-fc-relu.
- Parameters
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.necks.RelativeLocNeck(in_channels: int, out_channels: int, with_avg_pool: bool = True, norm_cfg: dict = {'type': 'BN1d'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
The neck of relative patch location: fc-bn-relu-dropout.
- Parameters
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’BN1d’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
- class mmselfsup.models.necks.SimMIMNeck(in_channels: int, encoder_stride: int)[source]¶
Pre-train Neck For SimMIM.
This neck reconstructs the original image from the shrunk feature map.
- Parameters
in_channels (int) – Channel dimension of the feature map.
encoder_stride (int) – The total stride of the encoder.
- class mmselfsup.models.necks.SwAVNeck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, with_l2norm: bool = True, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
The non-linear neck of SwAV: fc-bn-relu-fc-normalization.
- Parameters
in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
with_l2norm (bool) – whether to normalize the output after projection. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.
heads¶
- class mmselfsup.models.heads.BEiTV1Head(embed_dims: int, num_embed: int, loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶
Pretrain Head for BEiT v1.
Compute the logits and the cross entropy loss.
- Parameters
embed_dims (int) – The dimension of embedding.
num_embed (int) – The number of classification types.
loss (dict) – The config of loss.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.heads.BEiTV2Head(embed_dims: int, num_embed: int, loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶
Pretrain Head for BEiT.
Compute the logits and the cross entropy loss.
- Parameters
embed_dims (int) – The dimension of embedding.
num_embed (int) – The number of classification types.
loss (dict) – The config of loss.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.
- forward(feats: torch.Tensor, feats_cls_pt: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Generate loss.
- Parameters
feats (torch.Tensor) – Features from backbone.
feats_cls_pt (torch.Tensor) – Features from class late layers for pretraining.
target (torch.Tensor) – Target generated by target_generator.
mask (torch.Tensor) – Generated mask for pretraing.
- class mmselfsup.models.heads.CAEHead(loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Pretrain Head for CAE.
Compute the align loss and the main loss. In addition, this head also generates the prediction target generated by dalle.
- Parameters
loss (dict) – The config of loss.
tokenizer_path (str) – The path of the tokenizer.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.
- forward(logits: torch.Tensor, logits_target: torch.Tensor, latent_pred: torch.Tensor, latent_target: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Generate loss.
- Parameters
logits (torch.Tensor) – Logits generated by decoder.
logits_target (img_target) – Target generated by dalle for decoder prediction.
latent_pred (torch.Tensor) – Latent prediction by regressor.
latent_target (torch.Tensor) – Target for latent prediction, generated by teacher.
- Returns
- The tuple of loss.
loss_main (torch.Tensor): Cross entropy loss.
loss_align (torch.Tensor): MSE loss.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.heads.ClsHead(loss: dict, with_avg_pool: bool = False, in_channels: int = 2048, num_classes: int = 1000, vit_backbone: bool = False, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
Simplest classifier head, with only one fc layer.
- Parameters
loss (dict) – Config of the loss.
with_avg_pool (bool) – Whether to apply the average pooling after neck. Defaults to False.
in_channels (int) – Number of input channels. Defaults to 2048.
num_classes (int) – Number of classes. Defaults to 1000.
init_cfg (Dict or List[Dict], optional) – Initialization config dict.
- forward(x: Union[List[torch.Tensor], Tuple[torch.Tensor]], label: torch.Tensor) → torch.Tensor[source]¶
Get the loss.
- Parameters
x (List[Tensor] | Tuple[Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
label (torch.Tensor) – The label for cross entropy loss.
- Returns
The cross entropy loss.
- Return type
torch.Tensor
- logits(x: Union[List[torch.Tensor], Tuple[torch.Tensor]]) → List[torch.Tensor][source]¶
Get the logits before the cross_entropy loss.
This module is used to obtain the logits before the loss.
- Parameters
x (List[Tensor] | Tuple[Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
- Returns
A list of class scores.
- Return type
List[Tensor]
- class mmselfsup.models.heads.ContrastiveHead(loss: dict, temperature: float = 0.1)[source]¶
Head for contrastive learning.
The contrastive loss is implemented in this head and is used in SimCLR, MoCo, DenseCL, etc.
- Parameters
loss (dict) – Config dict for module of loss functions.
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 0.1.
- class mmselfsup.models.heads.LatentCrossCorrelationHead(in_channels: int, loss: dict)[source]¶
Head for latent feature cross correlation.
Part of the code is borrowed from script.
- Parameters
in_channels (int) – Number of input channels.
loss (dict) – Config dict for module of loss functions.
- class mmselfsup.models.heads.LatentPredictHead(loss: dict, predictor: dict)[source]¶
Head for latent feature prediction.
This head builds a predictor, which can be any registered neck component. For example, BYOL and SimSiam call this head and build NonLinearNeck. It also implements similarity loss between two forward features.
- Parameters
loss (dict) – Config dict for the loss.
predictor (dict) – Config dict for the predictor.
- class mmselfsup.models.heads.MAEPretrainHead(loss: dict, norm_pix: bool = False, patch_size: int = 16)[source]¶
Pre-training head for MAE.
- Parameters
loss (dict) – Config of loss.
norm_pix_loss (bool) – Whether or not normalize target. Defaults to False.
patch_size (int) – Patch size. Defaults to 16.
- construct_target(target: torch.Tensor) → torch.Tensor[source]¶
Construct the reconstruction target.
In addition to splitting images into tokens, this module will also normalize the image according to
norm_pix
.- Parameters
target (torch.Tensor) – Image with the shape of B x 3 x H x W
- Returns
Tokenized images with the shape of B x L x C
- Return type
torch.Tensor
- forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function of MAE head.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
- class mmselfsup.models.heads.MILANPretrainHead(loss: dict)[source]¶
MILAN pretrain head.
- Parameters
loss (dict) – Config of loss.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶
Forward function.
- Parameters
pred (torch.Tensor) – Predicted features, of shape (N, L, D).
target (torch.Tensor) – Target features, of shape (N, L, D).
mask (torch.Tensor) – The mask of the target image of shape.
- Returns
the reconstructed loss.
- Return type
torch.Tensor
- class mmselfsup.models.heads.MaskFeatPretrainHead(loss: dict)[source]¶
Pre-training head for MaskFeat.
It computes reconstruction loss between prediction and target in masked region.
- Parameters
loss (dict) – Config dict for module of loss functions.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward head.
- Parameters
latent (torch.Tensor) – Predictions, which is of shape B x (1 + L) x C.
target (torch.Tensor) – Hog features, which is of shape B x L x C.
mask (torch.Tensor) – The mask of the hog features, which is of shape B x H x W.
- Returns
The loss tensor.
- Return type
torch.Tensor
- class mmselfsup.models.heads.MixMIMPretrainHead(loss: dict, norm_pix: bool = False, patch_size: int = 16)[source]¶
MixMIM pretrain head.
- Parameters
loss (dict) – Config of loss.
norm_pix_loss (bool) – Whether or not normalize target. Defaults to False.
patch_size (int) – Patch size. Defaults to 16.
- forward(x_rec: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function of MixMIM head.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
- class mmselfsup.models.heads.MoCoV3Head(predictor: dict, loss: dict, temperature: float = 1.0)[source]¶
Head for MoCo v3 algorithms.
This head builds a predictor, which can be any registered neck component. It also implements latent contrastive loss between two forward features. Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/moco/builder.py.
- Parameters
predictor (dict) – Config dict for module of predictor.
loss (dict) – Config dict for module of loss functions.
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 1.0.
- class mmselfsup.models.heads.MultiClsHead(backbone: str = 'resnet50', in_indices: Sequence[int] = (0, 1, 2, 3, 4), pool_type: str = 'adaptive', num_classes: int = 1000, loss: dict = {'loss_weight': 1.0, 'type': 'mmcls.CrossEntropyLoss'}, with_last_layer_unpool: bool = False, cal_acc: bool = False, topk: Union[int, Tuple[int]] = (1), norm_cfg: dict = {'type': 'BN'}, init_cfg: Union[dict, List[dict]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶
Multiple classifier heads.
This head inputs feature maps from different stages of backbone, average pools each feature map to around 9000 dimensions, and then appends a linear classifier at each stage to predict corresponding class scores.
- Parameters
backbone (str) – Specify which backbone to use, only support ResNet50. Defaults to ‘resnet50’.
in_indices (Sequence[int]) – Input from which stages. Defaults to (0, 1, 2, 3, 4).
pool_type (str) – ‘adaptive’ or ‘specified’. If set to ‘adaptive’, use adaptive average pooling, otherwise use specified pooling params. Defaults to ‘adaptive’.
num_classes (int) – Number of classes. Defaults to 1000.
loss (dict) – The dict of loss information. Defaults to ‘mmcls.models.CrossEntro): Whether to unpool the features from last layer. Defaults to False.
cal_acc (bool) – Whether to calculate accuracy during training. If you use batch augmentations like Mixup and CutMix during training, it is pointless to calculate accuracy. Defaults to False.
topk (int | Tuple[int]) – Top-k accuracy. Defaults to
(1, )
.norm_cfg (dict) – Dict to construct and config norm layer. Defaults to
dict(type='BN')
.init_cfg (dict or List[dict]) – Initialization config dict. Defaults to
[ dict(type='Normal', std=0.01, layer='Linear'), dict(type='Constant', val=1, layer=['_BatchNorm', 'GroupNorm']) ]
- forward(feats: Union[list, tuple]) → list[source]¶
Compute multi-head scores.
- Parameters
feats (Sequence[torch.Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
- Returns
A list of class scores.
- Return type
List[torch.Tensor]
- loss(feats: Sequence[torch.Tensor], data_samples: List[mmcls.structures.cls_data_sample.ClsDataSample], **kwargs) → dict[source]¶
Calculate losses from the extracted features.
- Parameters
x (Sequence[torch.Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
gt_label (torch.Tensor) – The ground truth label.
- Returns
Dict of loss and accuracy.
- Return type
Dict[str, torch.Tensor]
- predict(feats: Sequence[torch.Tensor], data_samples: List[mmcls.structures.cls_data_sample.ClsDataSample]) → List[mmcls.structures.cls_data_sample.ClsDataSample][source]¶
Inference without augmentation.
- Parameters
feats (tuple[Tensor]) – The extracted features.
data_samples (List[BaseDataElement], optional) – The annotation data of every samples. If not None, set
pred_label
of the input data samples.
- Returns
- The data samples containing annotation,
prediction, etc.
- Return type
List[BaseDataElement]
- class mmselfsup.models.heads.SimMIMHead(patch_size: int, loss: dict)[source]¶
Pretrain Head for SimMIM.
- Parameters
patch_size (int) – Patch size of each token.
loss (dict) – The config for loss.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function of MAE Loss.
This method will expand mask to the size of the original image.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
losses¶
- class mmselfsup.models.losses.BEiTLoss[source]¶
Loss function for BEiT.
The BEiTLoss supports 2 diffenrent logits shared 1 target, like BEiT v2.
- forward(logits: Union[Tuple[torch.Tensor], torch.Tensor], target: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Forward function of BEiT Loss.
- Parameters
logits (torch.Tensor) – The outputs from the decoder.
target (torch.Tensor) – The targets generated by dalle.
- Returns
The main loss.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.losses.CAELoss(lambd: float)[source]¶
Loss function for CAE.
Compute the align loss and the main loss.
- Parameters
lambd (float) – The weight for the align loss.
- forward(logits: torch.Tensor, target: torch.Tensor, latent_pred: torch.Tensor, latent_target: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Forward function of CAE Loss.
- Parameters
logits (torch.Tensor) – The outputs from the decoder.
target (torch.Tensor) – The targets generated by dalle.
latent_pred (torch.Tensor) – The latent prediction from the regressor.
latent_target (torch.Tensor) – The latent target from the teacher network.
- Returns
The main loss and align loss.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.losses.CosineSimilarityLoss(shift_factor: float = 0.0, scale_factor: float = 1.0)[source]¶
Cosine similarity loss function.
Compute the similarity between two features and optimize that similarity as loss.
- Parameters
shift_factor (float) – The shift factor of cosine similarity. Default: 0.0.
scale_factor (float) – The scale factor of cosine similarity. Default: 1.0.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶
Forward function of cosine similarity loss.
- Parameters
pred (torch.Tensor) – The predicted features.
target (torch.Tensor) – The target features.
- Returns
The cosine similarity loss.
- Return type
torch.Tensor
- class mmselfsup.models.losses.CrossCorrelationLoss(lambd: float = 0.0051)[source]¶
Cross correlation loss function.
Compute the on-diagnal and off-diagnal loss.
- Parameters
lambd (float) – The weight for the off-diag loss.
- class mmselfsup.models.losses.MAEReconstructionLoss[source]¶
Loss function for MAE.
Compute the loss in masked region.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function of MAE Loss.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
- class mmselfsup.models.losses.PixelReconstructionLoss(criterion: str, channel: Optional[int] = None)[source]¶
Loss for the reconstruction of pixel in Masked Image Modeling.
This module measures the distance between the target image and the reconstructed image and compute the loss to optimize the model. Currently, This module only provides L1 and L2 loss to penalize the reconstructed error. In addition, a mask can be passed in the
forward
function to only apply loss on visible region, like that in MAE.- Parameters
criterion (str) – The loss the penalize the reconstructed error. Currently, only supports L1 and L2 loss
channel (int, optional) – The number of channels to average the reconstruction loss. If not None, the reconstruction loss will be divided by the channel. Defaults to None.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶
Forward function to compute the reconstrction loss.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
- class mmselfsup.models.losses.SimMIMReconstructionLoss(encoder_in_channels: int)[source]¶
Loss function for MAE.
Compute the loss in masked region.
- Parameters
encoder_in_channels (int) – Number of input channels for encoder.
- forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶
Forward function of MAE Loss.
- Parameters
pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.
- Returns
The reconstruction loss.
- Return type
torch.Tensor
- class mmselfsup.models.losses.SwAVLoss(feat_dim: int, sinkhorn_iterations: int = 3, epsilon: float = 0.05, temperature: float = 0.1, crops_for_assign: List[int] = [0, 1], num_crops: List[int] = [2], num_prototypes: int = 3000, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
The Loss for SwAV.
This Loss contains clustering and sinkhorn algorithms to compute Q codes. Part of the code is borrowed from script. The queue is built in engine/hooks/swav_hook.py.
- Parameters
feat_dim (int) – feature dimension of the prototypes.
sinkhorn_iterations (int) – number of iterations in Sinkhorn-Knopp algorithm. Defaults to 3.
epsilon (float) – regularization parameter for Sinkhorn-Knopp algorithm. Defaults to 0.05.
temperature (float) – temperature parameter in training loss. Defaults to 0.1.
crops_for_assign (List[int]) – list of crops id used for computing assignments. Defaults to [0, 1].
num_crops (List[int]) – list of number of crops. Defaults to [2].
num_prototypes (int) – number of prototypes. Defaults to 3000.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.
memories¶
- class mmselfsup.models.memories.ODCMemory(length: int, feat_dim: int, momentum: float, num_classes: int, min_cluster: int, **kwargs)[source]¶
Memory module for ODC.
This module includes the samples memory and the centroids memory in ODC. The samples memory stores features and pseudo-labels of all samples in the dataset; while the centroids memory stores features of cluster centroids.
- Parameters
length (int) – Number of features stored in the samples memory.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
num_classes (int) – Number of clusters.
min_cluster (int) – Minimal cluster size.
- class mmselfsup.models.memories.SimpleMemory(length: int, feat_dim: int, momentum: float, **kwargs)[source]¶
Simple feature memory bank.
This module includes the memory bank that stores running average features of all samples in the dataset. It is used in algorithms like NPID.
- Parameters
length (int) – Number of features stored in the memory bank.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
target_generators¶
- class mmselfsup.models.target_generators.CLIPGenerator(tokenizer_path: str)[source]¶
Get the features and attention from the last layer of CLIP.
This module is used to generate target features in masked image modeling.
- Parameters
tokenizer_path (str) – The path of the checkpoint of CLIP.
- forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Get the features and attention from the last layer of CLIP.
- Parameters
x (torch.Tensor) – The input image, which is of shape (N, 3, H, W).
- Returns
The features and attention from the last layer of CLIP, which are of shape (N, L, C) and (N, L, L), respectively.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class mmselfsup.models.target_generators.Encoder(n_hid: int = 256, n_blk_per_group: int = 2, input_channels: int = 3, vocab_size: int = 8192, device: torch.device = device(type='cpu'), requires_grad: bool = False, use_mixed_precision: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
- forward(x: torch.Tensor) → torch.Tensor[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmselfsup.models.target_generators.HOGGenerator(nbins: int = 9, pool: int = 8, gaussian_window: int = 16)[source]¶
Generate HOG feature for images.
This module is used in MaskFeat to generate HOG feature. The code is modified from file slowfast/models/operators.py. Here is the link of HOG wikipedia.
- Parameters
nbins (int) – Number of bin. Defaults to 9.
pool (float) – Number of cell. Defaults to 8.
gaussian_window (int) – Size of gaussian kernel. Defaults to 16.
- forward(x: torch.Tensor) → torch.Tensor[source]¶
Generate hog feature for each batch images.
- Parameters
x (torch.Tensor) – Input images of shape (N, 3, H, W).
- Returns
Hog features.
- Return type
torch.Tensor
- class mmselfsup.models.target_generators.LowFreqTargetGenerator(radius: int, img_size: Union[int, Tuple[int, int]])[source]¶
Generate low-frquency target for images.
This module is used in PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling to remove these high-frequency information from images.
- Parameters
radius (int) – radius of low pass filter.
img_size (Union[int, Tuple[int, int]]) – size of input images.
- class mmselfsup.models.target_generators.VQKD(encoder_config: dict, decoder_config: Optional[dict] = None, num_embed: int = 8192, embed_dims: int = 32, decay: float = 0.99, beta: float = 1.0, quantize_kmeans_init: bool = True, init_cfg: Optional[dict] = None)[source]¶
Vector-Quantized Knowledge Distillation.
The module only contains encoder and VectorQuantizer part Modified from https://github.com/microsoft/unilm/blob/master/beit2/modeling_vqkd.py
- Parameters
encoder_config (dict) – The config of encoder.
decoder_config (dict, optional) – The config of decoder. Currently, VQKD only support to build encoder. Defaults to None.
num_embed (int) – Number of embedding vectors in the codebook. Defaults to 8192.
embed_dims (int) – The dimension of embedding vectors in the codebook. Defaults to 32.
decay (float) – The decay parameter of EMA. Defaults to 0.99.
beta (float) – The mutiplier for VectorQuantizer loss. Defaults to 1.
quantize_kmeans_init (bool) – Whether to use k-means to initialize the VectorQuantizer. Defaults to True.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.
- encode(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Encode the input images and get corresponding results.
utils¶
- class mmselfsup.models.utils.CAEDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶
Image pre-processor for CAE.
Compared with the
mmselfsup.SelfSupDataPreprocessor
, this module will normalize the prediction image and target image with different normalization parameters.- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
Data in the same format as the model input.
- Return type
Tuple[torch.Tensor, Optional[list]]
- class mmselfsup.models.utils.CAETransformerRegressorLayer(embed_dims: int, num_heads: int, feedforward_channels: int, num_fcs: int = 2, qkv_bias: bool = False, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, init_values: float = 0.0, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'})[source]¶
Transformer layer for the regressor of CAE.
This module is different from conventional transformer encoder layer, for its queries are the masked tokens, but its keys and values are the concatenation of the masked and unmasked tokens.
- Parameters
embed_dims (int) – The feature dimension.
num_heads (int) – The number of heads in multi-head attention.
feedforward_channels (int) – The hidden dimension of FFNs. Defaults: 1024.
num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.
qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.drop_rate (float) – The dropout rate. Defaults to 0.0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
init_values (float) – The init values of gamma. Defaults to 0.0.
act_cfg (dict) – The activation config for FFNs. Defaluts to
dict(type='GELU')
.norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.
- class mmselfsup.models.utils.CosineEMA(model: torch.nn.modules.module.Module, momentum: float = 0.996, end_momentum: float = 1.0, interval: int = 1, device: Optional[torch.device] = None, update_buffers: bool = False)[source]¶
CosineEMA is implemented for updating momentum parameter, used in BYOL, MoCoV3, etc.
The momentum parameter is updated with cosine annealing, including momentum adjustment following:
\[m = m_1 - (m_1 - m_0) * (cos(pi * k / K) + 1) / 2\]where \(k\) is the current step, \(K\) is the total steps.
- Parameters
model (nn.Module) – The model to be averaged.
momentum (float) – The momentum used for updating ema parameter. Ema’s parameter are updated with the formula: averaged_param = momentum * averaged_param + (1-momentum) * source_param. Defaults to 0.996.
end_momentum (float) – The end momentum value for cosine annealing. Defaults to 1.
interval (int) – Interval between two updates. Defaults to 1.
device (torch.device, optional) – If provided, the averaged model will be stored on the
device
. Defaults to None.update_buffers (bool) – if True, it will compute running averages for both the parameters and the buffers of the model. Defaults to False.
- avg_func(averaged_param: torch.Tensor, source_param: torch.Tensor, steps: int) → None[source]¶
Compute the moving average of the parameters using the cosine momentum strategy.
- Parameters
averaged_param (Tensor) – The averaged parameters.
source_param (Tensor) – The source parameters.
steps (int) – The number of times the parameters have been updated.
- Returns
The averaged parameters.
- Return type
Tensor
- class mmselfsup.models.utils.Extractor(extract_dataloader: Union[torch.utils.data.dataloader.DataLoader, dict], seed: Optional[int] = None, dist_mode: bool = False, pool_cfg: Optional[dict] = None, **kwargs)[source]¶
Feature extractor.
The extractor support to build its own DataLoader, customized models, pooling type. It also has distributed and non-distributed mode.
- Parameters
extract_dataloader (dict) – A dict to build DataLoader object.
seed (int, optional) – Random seed. Defaults to None.
dist_mode (bool) – Use distributed extraction or not. Defaults to False.
pool_cfg (dict, optional) – The configs of pooling. Defaults to dict(type=’AvgPool2d’, output_size=1).
- class mmselfsup.models.utils.GatherLayer(*args, **kwargs)[source]¶
Gather tensors from all process, supporting backward propagation.
- static backward(ctx: Any, *grads: torch.Tensor) → torch.Tensor[source]¶
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs as theforward()
returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
- static forward(ctx: Any, input: torch.Tensor) → Tuple[List][source]¶
Performs the operation.
This function is to be overridden by all subclasses.
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with
ctx.save_for_backward()
if they are intended to be used inbackward
(equivalently,vjp
) orctx.save_for_forward()
if they are intended to be used for injvp
.
- class mmselfsup.models.utils.MultiPooling(pool_type: str = 'adaptive', in_indices: tuple = (0), backbone: str = 'resnet50')[source]¶
Pooling layers for features from multiple depth.
- Parameters
pool_type (str) – Pooling type for the feature map. Options are ‘adaptive’ and ‘specified’. Defaults to ‘adaptive’.
in_indices (Sequence[int]) – Output from which backbone stages. Defaults to (0, ).
backbone (str) – The selected backbone. Defaults to ‘resnet50’.
- class mmselfsup.models.utils.MultiPrototypes(output_dim: int, num_prototypes: List[int])[source]¶
Multi-prototypes for SwAV head.
- Parameters
output_dim (int) – The output dim from SwAV neck.
num_prototypes (List[int]) – The number of prototypes needed.
- class mmselfsup.models.utils.MultiheadAttention(embed_dims: int, num_heads: int, input_dims: Optional[int] = None, attn_drop: float = 0.0, proj_drop: float = 0.0, qkv_bias: bool = True, qk_scale: Optional[float] = None, proj_bias: bool = True, init_cfg: Optional[dict] = None)[source]¶
Multi-head Attention Module.
This module rewrite the MultiheadAttention by replacing qkv bias with customized qkv bias, in addition to removing the drop path layer.
- Parameters
embed_dims (int) – The embedding dimension.
num_heads (int) – Parallel attention heads.
input_dims (int, optional) – The input dimension, and if None, use
embed_dims
. Defaults to None.attn_drop (float) – Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0.
proj_drop (float) – Dropout rate of the dropout layer after the output projection. Defaults to 0.
dropout_layer (dict) – The dropout config before adding the shortcut. Defaults to
dict(type='Dropout', drop_prob=0.)
.qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.proj_bias (bool) – Defaults to True.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.
- class mmselfsup.models.utils.NormEMAVectorQuantizer(num_embed: int, embed_dims: int, beta: float, decay: float = 0.99, statistic_code_usage: bool = True, kmeans_init: bool = True, codebook_init_path: Optional[str] = None)[source]¶
Normed EMA vector quantizer module.
- Parameters
num_embed (int) – Number of embedding vectors in the codebook. Defaults to 8192.
embed_dims (int) – The dimension of embedding vectors in the codebook. Defaults to 32.
beta (float) – The mutiplier for VectorQuantizer embedding loss. Defaults to 1.
decay (float) – The decay parameter of EMA. Defaults to 0.99.
statistic_code_usage (bool) – Whether to use cluster_size to record statistic. Defaults to True.
kmeans_init (bool) – Whether to use k-means to initialize the VectorQuantizer. Defaults to True.
codebook_init_path (str) – The initialization checkpoint for codebook. Defaults to None.
- class mmselfsup.models.utils.PromptTransformerEncoderLayer(embed_dims: int, num_heads: int, feedforward_channels=<class 'int'>, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, num_fcs: int = 2, qkv_bias: bool = True, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶
Prompt Transformer Encoder Layer for MILAN.
This module is specific for the prompt encoder in MILAN. It will not update the visible tokens from the encoder.
- Parameters
embed_dims (int) – The feature dimension.
num_heads (int) – Parallel attention heads.
feedforward_channels (int) – The hidden dimension for FFNs.
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Defaults to 0.0.
attn_drop_rate (float) – The drop out rate for attention layer. Defaults to 0.0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.0.
num_fcs (int) – The number of fully-connected layers for FFNs. Defaults to 2.
qkv_bias (bool) – Enable bias for qkv if True. Defaults to True.
act_cfg (dict) – The activation config for FFNs. Defaluts to
dict(type='GELU')
.norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.batch_first (bool) – Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Defaults to False.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.
- forward(x: torch.Tensor, visible_tokens: torch.Tensor, ids_restore: torch.Tensor) → torch.Tensor[source]¶
Forward function for PromptMultiheadAttention.
- Parameters
x (torch.Tensor) – Mask token features with shape N x L_m x C.
visible_tokens (torch.Tensor) – The visible tokens features from encoder with shape N x L_v x C.
ids_restore (torch.Tensor) – The ids of all tokens in the original image with shape N x L.
- Returns
Output features with shape N x L x C.
- Return type
torch Tensor
- class mmselfsup.models.utils.RelativeLocDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶
Image pre-processor for Relative Location.
- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
Data in the same format as the model input.
- Return type
Tuple[torch.Tensor, Optional[list]]
- class mmselfsup.models.utils.RotationPredDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶
Image pre-processor for Relative Location.
- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
Data in the same format as the model input.
- Return type
Tuple[torch.Tensor, Optional[list]]
- class mmselfsup.models.utils.SelfSupDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶
Image pre-processor for operations, like normalization and bgr to rgb.
Compared with the
mmengine.ImgDataPreprocessor
, this module treats each item in inputs of input data as a list, instead of torch.Tensor.- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
Data in the same format as the model input.
- Return type
Tuple[torch.Tensor, Optional[list]]
- class mmselfsup.models.utils.TransformerEncoderLayer(embed_dims: int, num_heads: int, feedforward_channels: int, window_size: Optional[int] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, num_fcs: int = 2, qkv_bias: bool = True, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'type': 'LN'}, init_values: float = 0.0, init_cfg: Optional[dict] = None)[source]¶
Implements one encoder layer in Vision Transformer.
This module is the rewritten version of the TransformerEncoderLayer in MMClassification by adding the gamma and relative position bias in Attention module.
- Parameters
embed_dims (int) – The feature dimension.
num_heads (int) – Parallel attention heads
feedforward_channels (int) – The hidden dimension for FFNs
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Defaults to 0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
num_fcs (int) – The number of fully-connected layers for FFNs. Defaults to 2.
qkv_bias (bool) – enable bias for qkv if True. Defaults to True.
act_cfg (dict) – The activation config for FFNs. Defaluts to
dict(type='GELU')
.norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.init_values (float) – The init values of gamma. Defaults to 0.0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.
- class mmselfsup.models.utils.TwoNormDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, second_mean: Optional[Sequence[Union[int, float]]] = None, second_std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶
Image pre-processor for CAE, BEiT v1/v2, etc.
Compared with the
mmselfsup.SelfSupDataPreprocessor
, this module will normalize the prediction image and target image with different normalization parameters.- Parameters
mean (Sequence[float or int], optional) – The pixel mean of image channels. If
bgr_to_rgb=True
it means the mean value of R, G, B channels. If the length of mean is 1, it means all channels have the same mean value, or the input is a gray image. If it is not specified, images will not be normalized. Defaults None.std (Sequence[float or int], optional) – The pixel standard deviation of image channels. If
bgr_to_rgb=True
it means the standard deviation of R, G, B channels. If the length of std is 1, it means all channels have the same standard deviation, or the input is a gray image. If it is not specified, images will not be normalized. Defaults None.second_mean (Sequence[float or int], optional) – The description is like
mean
, it can be customized for targe image. Defaults None.second_std (Sequence[float or int], optional) – The description is like
std
, it can be customized for targe image. Defaults None.pad_size_divisor (int) – The size of padded image should be divisible by
pad_size_divisor
. Defaults to 1.pad_value (float or int) – The padded pixel value. Defaults to 0.
bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.
rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.
non_blocking (bool) – Whether block current process when transferring data to device.
- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
- Data in the same format as the
model input.
- Return type
Tuple[torch.Tensor, Optional[list]]
- class mmselfsup.models.utils.VideoDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, format_shape: str = 'NCHW')[source]¶
Video pre-processor for operations, like normalization and bgr to rgb conversion .
Compared with the
mmaction.ActionDataPreprocessor
, this module treats each item in inputs of input data as a list, instead of torch.Tensor.- Parameters
mean (Sequence[float or int, optional) – The pixel mean of channels of images or stacked optical flow. Defaults to None.
std (Sequence[float or int], optional) – The pixel standard deviation of channels of images or stacked optical flow. Defaults to None.
pad_size_divisor (int) – The size of padded image should be divisible by
pad_size_divisor
. Defaults to 1.pad_value (float or int) – The padded pixel value. Defaults to 0.
bgr_to_rgb (bool) – Whether to convert image from BGR to RGB. Defaults to False.
format_shape (str) – Format shape of input data. Defaults to
'NCHW'
.
- forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶
Performs normalization、padding and bgr2rgb conversion based on
BaseDataPreprocessor
.- Parameters
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of
training
.
- Returns
- Data in the same format
as the model input.
- Return type
Tuple[List[torch.Tensor], Optional[list]]
- mmselfsup.models.utils.build_2d_sincos_position_embedding(patches_resolution: Union[int, Sequence[int]], embed_dims: int, temperature: Optional[int] = 10000.0, cls_token: Optional[bool] = False) → torch.Tensor[source]¶
The function is to build position embedding for model to obtain the position information of the image patches.
- Parameters
patches_resolution (Union[int, Sequence[int]]) – The resolution of each patch.
embed_dims (int) – The dimension of the embedding vector.
temperature (int, optional) – The temperature parameter. Defaults to 10000.
cls_token (bool, optional) – Whether to concatenate class token. Defaults to False.
- Returns
The position embedding vector.
- Return type
torch.Tensor
- mmselfsup.models.utils.build_clip_model(state_dict: dict, finetune: bool = False, average_targets: int = 1) → torch.nn.modules.module.Module[source]¶
Build the CLIP model.
- Parameters
state_dict (dict) – The pretrained state dict.
finetune (bool) – Whether to fineturn the model.
average_targets (bool) – Whether to average the target.
- Returns
The CLIP model.
- Return type
nn.Module
mmselfsup.structures¶
- class mmselfsup.structures.SelfSupDataSample(*, metainfo: Optional[dict] = None, **kwargs)[source]¶
A data structure interface of MMSelfSup. They are used as interfaces between different components.
Meta field:
img_shape
(Tuple): The shape of the corresponding input image. Used for visualization.ori_shape
(Tuple): The original shape of the corresponding image. Used for visualization.img_path
(str): The path of original image.
Data field:
gt_label
(LabelData): The ground truth label of an image.sample_idx
(InstanceData): The idx of an image in the dataset.mask
(BaseDataElement): Mask used in masks image modeling.pred_label
(LabelData): The predicted label.pseudo_label
(InstanceData): Label used in pretext task, e.g. Relative Location.
Examples
>>> import torch >>> import numpy as np >>> from mmengine.structure import InstanceData >>> from mmselfsup.structures import SelfSupDataSample
>>> data_sample = SelfSupDataSample() >>> gt_label = LabelData() >>> gt_label.value = [1] >>> data_sample.gt_label = gt_label >>> len(data_sample.gt_label) 1 >>> print(data_sample) <SelfSupDataSample( META INFORMATION DATA FIELDS gt_label: <InstanceData( META INFORMATION DATA FIELDS value: [1] ) at 0x7f15c08f9d10> _gt_label: <InstanceData( META INFORMATION DATA FIELDS value: [1] ) at 0x7f15c08f9d10> ) at 0x7f15c077ef10>
>>> idx = InstanceData() >>> idx.value = [0] >>> data_sample = SelfSupDataSample(idx=idx) >>> assert 'idx' in data_sample
>>> data_sample = SelfSupDataSample() >>> mask = dict(value=np.random.rand(48, 48)) >>> mask = PixelData(**mask) >>> data_sample.mask = mask >>> assert 'mask' in data_sample >>> assert 'value' in data_sample.mask
>>> data_sample = SelfSupDataSample() >>> pred_label = dict(pred_label=[3]) >>> pred_label = LabelData(**pred_label) >>> data_sample.pred_label = pred_label >>> print(data_sample) <SelfSupDataSample( META INFORMATION DATA FIELDS _pred_label: <InstanceData( META INFORMATION DATA FIELDS pred_label: [3] ) at 0x7f15c06a3990> pred_label: <InstanceData( META INFORMATION DATA FIELDS pred_label: [3] ) at 0x7f15c06a3990> ) at 0x7f15c07b8bd0>
mmselfsup.visualization¶
- class mmselfsup.visualization.SelfSupVisualizer(name: str = 'visualizer', image: Optional[numpy.ndarray] = None, vis_backends: Optional[List[Dict]] = None, save_dir: Optional[str] = None, line_width: Union[int, float] = 3, alpha: Union[int, float] = 0.8)[source]¶
MMSelfSup Visualizer.
- Parameters
name (str) – Name of the instance. Defaults to ‘visualizer’.
image (np.ndarray, optional) – the origin image to draw. The format should be RGB. Defaults to None.
vis_backends (list, optional) – Visual backend config list. Defaults to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
line_width (int, float) – The linewidth of lines. Defaults to 3.
alpha (int, float) – The transparency of boxes or mask. Defaults to 0.8.
Examples
>>> import numpy as np >>> import torch >>> from mmengine.structures import InstanceData >>> from mmselfsup.structures import SelfSupDataSample >>> from mmselfsup.visualization import SelfSupVisualizer
>>> selfsup_visualizer = SelfSupVisualizer() >>> image = np.random.randint(0, 256, ... size=(10, 12, 3)).astype('uint8') >>> pseudo_label = InstanceData() >>> pseudo_label.patch_box = torch.Tensor([[1, 2, 2, 5]]) >>> gt_selfsup_data_sample = SelfSupDataSample() >>> gt_selfsup_data_sample.pseudo_label = pseudo_label >>> selfsup_visualizer.add_datasample('image', image, ... gt_selfsup_data_sample) >>> selfsup_visualizer.add_datasample( ... 'image', image, gt_selfsup_data_sample, ... out_file='out_file.jpg') >>> selfsup_visualizer.add_datasample( ... 'image', image, gt_selfsup_data_sample, ... show=True) >>> pseudo_label = InstanceData() >>> pseudo_label.patch_box = torch.Tensor([[1, 2, 2, 5]]) >>> pred_selfsup_data_sample = SelfSupDataSample() >>> pred_selfsup_data_sample.pseudo_label = pseudo_label >>> selfsup_visualizer.add_datasample('image', image, ... gt_selfsup_data_sample, ... pred_selfsup_data_sample)
- add_datasample(name: str, image: numpy.ndarray, gt_sample: Optional[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample] = None, pred_sample: Optional[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample] = None, draw_gt: bool = True, draw_pred: bool = True, show: bool = False, wait_time: float = 0, out_file: Optional[str] = None, step: int = 0) → None[source]¶
Draw datasample and save to all backends.
If GT and prediction are plotted at the same time, they are displayed in a stitched image where the left image is the ground truth and the right image is the prediction.
If
show
is True, all storage backends are ignored, and the images will be displayed in a local window.If
out_file
is specified, the drawn image will be saved toout_file
. t is usually used when the display is not available.
- Parameters
name (str) – The image identifier.
image (np.ndarray) – The image to draw.
gt_sample (
SelfSupDataSample
, optional) – GT SelfSupDataSample. Defaults to None.pred_sample (
SelfSupDataSample
, optional) – Prediction SelfSupDataSample. Defaults to None.draw_gt (bool) – Whether to draw GT SelfSupDataSample. Default to True.
draw_pred (bool) – Whether to draw Prediction SelfSupDataSample. Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (float) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
step (int) – Global step value to record. Defaults to 0.
mmselfsup.utils¶
- class mmselfsup.utils.AliasMethod(probs: torch.Tensor)[source]¶
The alias method for sampling.
- Parameters
probs (torch.Tensor) – Sampling probabilities.
- mmselfsup.utils.batch_shuffle_ddp(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
Batch shuffle, for making use of BatchNorm.
- Parameters
x (torch.Tensor) – Data in each GPU.
- Returns
- Output of shuffle operation.
x_gather[idx_this]: Shuffled data.
idx_unshuffle: Index for restoring.
- Return type
Tuple[torch.Tensor, torch.Tensor]
- mmselfsup.utils.batch_unshuffle_ddp(x: torch.Tensor, idx_unshuffle: torch.Tensor) → torch.Tensor[source]¶
Undo batch shuffle.
- Parameters
x (torch.Tensor) – Data in each GPU.
idx_unshuffle (torch.Tensor) – Index for restoring.
- Returns
Output of unshuffle operation.
- Return type
torch.Tensor
- mmselfsup.utils.concat_all_gather(tensor: torch.Tensor) → torch.Tensor[source]¶
Performs all_gather operation on the provided tensors.
- Parameters
tensor (torch.Tensor) – Tensor to be broadcast from current process.
- Returns
The concatnated tensor.
- Return type
torch.Tensor
- mmselfsup.utils.dist_forward_collect(func: object, data_loader: torch.utils.data.dataloader.DataLoader, length: int) → dict[source]¶
Forward and collect network outputs in a distributed manner.
This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.
- Parameters
func (function) – The function to process data.
data_loader (DataLoader) – the torch DataLoader to yield data.
length (int) – Expected length of output arrays.
- Returns
The collected outputs.
- Return type
Dict[str, torch.Tensor]
- mmselfsup.utils.distributed_sinkhorn(out: torch.Tensor, sinkhorn_iterations: int, world_size: int, epsilon: float) → torch.Tensor[source]¶
Apply the distributed sinknorn optimization on the scores matrix to find the assignments.
- Parameters
out (torch.Tensor) – The scores matrix
sinkhorn_iterations (int) – Number of iterations in Sinkhorn-Knopp algorithm.
world_size (int) – The world size of the process group.
epsilon (float) – regularization parameter for Sinkhorn-Knopp algorithm.
- Returns
Output of sinkhorn algorithm.
- Return type
torch.Tensor
- mmselfsup.utils.get_model(model: torch.nn.modules.module.Module) → mmengine.model.base_model.base_model.BaseModel[source]¶
Get model if the input model is a model wrapper.
- Parameters
model (nn.Module) – A model may be a model wrapper.
- Returns
The model without model wrapper.
- Return type
- mmselfsup.utils.nondist_forward_collect(func: object, data_loader: torch.utils.data.dataloader.DataLoader, length: int) → dict[source]¶
Forward and collect network outputs.
This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.
- Parameters
func (function) – The function to process data.
data_loader (DataLoader) – the torch DataLoader to yield data.
length (int) – Expected length of output arrays.
- Returns
The concatenated outputs.
- Return type
Dict[str, torch.Tensor]
- mmselfsup.utils.register_all_modules(init_default_scope: bool = True) → None[source]¶
Register all modules in mmselfsup into the registries.
- Parameters
init_default_scope (bool) – Whether initialize the mmselfsup default scope. When init_default_scope=True, the global default scope will be set to mmselfsup, and all registries will build modules from mmselfsup’s registry node. To understand more about the registry, please refer to https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/registry.md Defaults to True.