mmselfsup.datasets¶

datasets¶

class mmselfsup.datasets.DeepClusterImageNet(ann_file: str = '', metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶

ImageNet Dataset.

The dataset inherit ImageNet dataset from MMClassification as the DeepCluster and Online Deep Clustering algorithm need to initialize clustering labels and assign them during training.

Parameters

ann_file (str) – Annotation file path. Defaults to None.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for data_prefix and ann_file. Defaults to None.
data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in CustomDataset and BaseDataset.

assign_labels(labels: list) → None[source]¶

Assign new labels to self.clustering_labels.

Parameters: labels (list) – The new labels.
Returns: None

prepare_data(idx: int) → Any[source]¶

Get data processed by self.pipeline.

Parameters: idx (int) – The index of data_info.
Returns: Depends on self.pipeline.
Return type: Any

class mmselfsup.datasets.ImageList(ann_file: str, metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶

The dataset implementation for loading any image list file.

The ImageList can load an annotation file or a list of files and merge all data records to one list. If data is unlabeled, the gt_label will be set -1.

An annotation file should be provided, and each line indicates a sample:

The sample files:

data_prefix/
├── folder_1
│   ├── xxx.png
│   ├── xxy.png
│   └── ...
└── folder_2
    ├── 123.png
    ├── nsdf3.png
    └── ...

1. If data is labeled, the annotation file (the first column is the image path and the second column is the index of category):

    folder_1/xxx.png 0
    folder_1/xxy.png 1
    folder_2/123.png 5
    folder_2/nsdf3.png 3
    ...

2. If data is unlabeled, the annotation file is: ::

    folder_1/xxx.png
    folder_1/xxy.png
    folder_2/123.png
    folder_2/nsdf3.png
    ...

Parameters

ann_file (str) – Annotation file path.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for data_prefix and ann_file. Defaults to None.
data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in CustomDataset and BaseDataset.

load_data_list() → List[dict][source]¶

Rewrite load_data_list() function for supporting annotation files with unlabeled data.

Returns: A list of data information.
Return type: List[dict]

class mmselfsup.datasets.Places205(ann_file: str = '', metainfo: Optional[dict] = None, data_root: str = '', data_prefix: Union[str, dict] = '', **kwargs)[source]¶

Places205 Dataset.

The dataset supports two kinds of annotation format. More details can be found in CustomDataset.

Parameters

ann_file (str) – Annotation file path. Defaults to None.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for data_prefix and ann_file. Defaults to None.
data_prefix (str | dict) – Prefix for training data. Defaults to None.
**kwargs – Other keyword arguments in CustomDataset and BaseDataset.

mmselfsup.datasets.build_dataset(cfg)[source]¶: Build dataset.

transforms¶

class mmselfsup.datasets.transforms.BEiTMaskGenerator(input_size: int, num_masking_patches: int, min_num_patches: int = 4, max_num_patches: Optional[int] = None, min_aspect: float = 0.3, max_aspect: Optional[float] = None)[source]¶

Generate mask for image.

Added Keys:

mask

This module is borrowed from https://github.com/microsoft/unilm/tree/master/beit

Parameters

input_size (int) – The size of input image.
num_masking_patches (int) – The number of patches to be masked.
min_num_patches (int) – The minimum number of patches to be masked in the process of generating mask. Defaults to 4.
max_num_patches (int, optional) – The maximum number of patches to be masked in the process of generating mask. Defaults to None.
min_aspect (float, optional) – The minimum aspect ratio of mask blocks. Defaults to 0.3.
min_aspect – The minimum aspect ratio of mask blocks. Defaults to None.

get_shape() → Tuple[int, int][source]¶

Get the shape of mask.

Returns: The shape of mask.
Return type: Tuple[int, int]

transform(results: dict) → dict[source]¶

Method to generate random block mask for each Image in BEiT.

Parameters: results (dict) – Result dict from previous pipeline.
Returns: Result dict with added key mask.
Return type: dict

class mmselfsup.datasets.transforms.ColorJitter(brightness: Union[float, List[float]] = 0, contrast: Union[float, List[float]] = 0, saturation: Union[float, List[float]] = 0, hue: Union[float, List[float]] = 0, backend: str = 'pillow')[source]¶

Randomly change the brightness, contrast, saturation and hue of an image.

Modified from https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py

Required Keys:

img

Modified Keys:

img

Parameters

brightness (float or tuple of float (min, max)) – How much to jitter brightness. brightness_factor is chosen uniformly from [max(0, 1 - brightness), 1 + brightness] or the given [min, max]. Should be non negative numbers.
contrast (float or tuple of float (min, max)) – How much to jitter contrast. contrast_factor is chosen uniformly from [max(0, 1 - contrast), 1 + contrast] or the given [min, max]. Should be non negative numbers.
saturation (float or tuple of float (min, max)) – How much to jitter saturation. saturation_factor is chosen uniformly from [max(0, 1 - saturation), 1 + saturation] or the given [min, max]. Should be non negative numbers.
hue (float or tuple of float (min, max)) – How much to jitter hue. hue_factor is chosen uniformly from [-hue, hue] or the given [min, max]. Should have 0 <= hue <= 0.5 or -0.5 <= min <= max <= 0.5. To jitter hue, the pixel values of the input image has to be non-negative for conversion to HSV space; thus it does not work if you normalize your image to an interval with negative values, or use an interpolation that generates negative values before using this function.
backend (str) – The type of image processing backend. Options are cv2, pillow. Defaults to pillow.

static get_params(brightness: Optional[List[float]], contrast: Optional[List[float]], saturation: Optional[List[float]], hue: Optional[List[float]]) → Tuple[numpy.ndarray, Optional[float], Optional[float], Optional[float], Optional[float]][source]¶

Get the parameters for the randomized transform to be applied on image.

Parameters

brightness (tuple of float (min, max), optional) – The range from which the brightness_factor is chosen uniformly. Pass None to turn off the transformation.
contrast (tuple of float (min, max), optional) – The range from which the contrast_factor is chosen uniformly. Pass None to turn off the transformation.
saturation (tuple of float (min, max), optional) – The range from which the saturation_factor is chosen uniformly. Pass None to turn off the transformation.
hue (tuple of float (min, max), optional) – The range from which the hue_factor is chosen uniformly. Pass None to turn off the transformation.

Returns

The parameters used to apply the randomized transform: along with their random order.

Return type

tuple

transform(results: dict) → dict[source]¶

Randomly change the brightness, contrast, saturation and hue of an image. # noqa: E501.

Parameters: results (dict) – The results dict from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.MAERandomResizedCrop(size, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=<InterpolationMode.BILINEAR: 'bilinear'>, antialias: Optional[bool] = None)[source]¶

RandomResizedCrop for matching TF/TPU implementation: no for-loop is used.

This may lead to results different with torchvision’s version. Following BYOL’s TF code: https://github.com/deepmind/deepmind-research/blob/master/byol/utils/dataset.py#L206 # noqa: E501

forward(results: dict) → dict[source]¶

The forward function of MAERandomResizedCrop.

Parameters: results (dict) – The results dict contains the image and all these information related to the image.
Returns: The results dict contains the cropped image and all these information related to the image.
Return type: dict

static get_params(img: PIL.Image.Image, scale: tuple, ratio: tuple) → Tuple[source]¶

Get parameters for crop for a random sized crop.

Parameters

img (PIL Image or Tensor) – Input image.
scale (list) – range of scale of the origin size cropped
ratio (list) – range of aspect ratio of the origin aspect ratio cropped

Returns

params (i, j, h, w) to be passed to crop for a random sized crop.

Return type

tuple

class mmselfsup.datasets.transforms.MultiView(transforms: List[List[Union[dict, Callable[[dict], dict]]]], num_views: Union[int, List[int]])[source]¶

A transform wrapper for multiple views of an image.

Parameters

transforms (list[dict | callable], optional) – Sequence of transform object or config dict to be wrapped.
mapping (dict) – A dict that defines the input key mapping. The keys corresponds to the inner key (i.e., kwargs of the transform method), and should be string type. The values corresponds to the outer keys (i.e., the keys of the data/results), and should have a type of string, list or dict. None means not applying input mapping. Default: None.
allow_nonexist_keys (bool) – If False, the outer keys in the mapping must exist in the input data, or an exception will be raised. Default: False.

Examples

>>> # Example 1: MultiViews 1 pipeline with 2 views
>>> pipeline = [
>>>     dict(type='MultiView',
>>>         num_views=2,
>>>         transforms=[
>>>             [
>>>                dict(type='Resize', scale=224))],
>>>         ])
>>> ]
>>> # Example 2: MultiViews 2 pipelines, the first with 2 views,
>>> # the second with 6 views
>>> pipeline = [
>>>     dict(type='MultiView',
>>>         num_views=[2, 6],
>>>         transforms=[
>>>             [
>>>                dict(type='Resize', scale=224)],
>>>             [
>>>                dict(type='Resize', scale=224),
>>>                dict(type='RandomSolarize')],
>>>         ])
>>> ]

transform(results: dict) → dict[source]¶

Apply transformation to inputs.

Parameters: results (dict) – Result dict from previous pipelines.
Returns: Transformed results.
Return type: dict

class mmselfsup.datasets.transforms.PackSelfSupInputs(key: str = 'img', algorithm_keys: List[str] = [], pseudo_label_keys: List[str] = [], meta_keys: List[str] = [])[source]¶

Pack data into the format compatible with the inputs of algorithm.

Required Keys:

img

Added Keys:

data_samples
inputs

Parameters

key (str) – The key of image inputted into the model. Defaults to ‘img’.
algorithm_keys (List[str]) – Keys of elements related to algorithms, e.g. mask. Defaults to [].
pseudo_label_keys (List[str]) – Keys set to be the attributes of pseudo_label. Defaults to [].
meta_keys (List[str]) – The keys of meta info of an image. Defaults to [].

classmethod set_algorithm_keys(data_sample: mmselfsup.structures.selfsup_data_sample.SelfSupDataSample, key: str, results: dict) → None[source]¶

Set the algorithm keys of SelfSupDataSample.

Parameters

data_sample (SelfSupDataSample) – An instance of SelfSupDataSample.
key (str) – The key, which may be used by the algorithm, such as gt_label, sample_idx, mask, pred_label. For more keys, please refer to the attribute of SelfSupDataSample.
results (dict) – The results from the data pipeline.

transform(results: Dict) → Dict[torch.Tensor, mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

Method to pack the data.

Parameters

results (Dict) – Result dict from the data pipeline.

Returns

inputs (List[torch.Tensor]): The forward data of models.
data_samples (SelfSupDataSample): The annotation info of the forward data.

Return type

Dict

class mmselfsup.datasets.transforms.RandomCrop(size: Union[int, Sequence[int]], padding: Optional[Union[int, Sequence[int]]] = None, pad_if_needed: bool = False, pad_val: Union[numbers.Number, Sequence[numbers.Number]] = 0, padding_mode: str = 'constant')[source]¶

Crop the given Image at a random location.

Required Keys:

img

Modified Keys:

img
img_shape

Parameters

size (int or Sequence) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.
padding (int or Sequence, optional) – Optional padding on each border of the image. If a sequence of length 4 is provided, it is used to pad left, top, right, bottom borders respectively. If a sequence of length 2 is provided, it is used to pad left/right, top/bottom borders, respectively. Default: None, which means no padding.
pad_if_needed (boolean) – It will pad the image if smaller than the desired size to avoid raising an exception. Since cropping is done after padding, the padding seems to be done at a random offset. Default: False.
pad_val (Number | Sequence[Number]) – Pixel pad_val value for constant fill. If a tuple of length 3, it is used to pad_val R, G, B channels respectively. Default: 0.
padding_mode (str) –
Type of padding. Defaults to “constant”. Should be one of the following:
- constant: Pads with a constant value, this value is specified with pad_val.
- edge: pads with the last value at the edge of the image.
- reflect: Pads with reflection of image without repeating the last value on the edge. For example, padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, 3, 4, 3, 2].
- symmetric: Pads with reflection of image repeating the last value on the edge. For example, padding [1, 2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, 2, 3, 4, 4, 3].

static get_params(img: numpy.ndarray, output_size: Tuple) → Tuple[source]¶

Get parameters for crop for a random crop.

Parameters

img (np.ndarray) – Image to be cropped.
output_size (Tuple) – Expected output size of the crop.

Returns

Params (xmin, ymin, target_height, target_width) to be: passed to crop for random crop.

Return type

tuple

transform(results: dict) → dict[source]¶

Randomly crop the image.

Parameters: results (dict) – Result dict from previous pipeline.
Returns: Result dict with the transformed image.
Return type: dict

class mmselfsup.datasets.transforms.RandomGaussianBlur(sigma_min: float, sigma_max: float, prob: Optional[float] = 0.5)[source]¶

GaussianBlur augmentation refers to SimCLR.

Paper link.

Required Keys:

img

Modified Keys:

img

Parameters

sigma_min (float) – The minimum parameter of Gaussian kernel std.
sigma_max (float) – The maximum parameter of Gaussian kernel std.
prob (float, optional) – Probability. Defaults to 0.5.

transform(results: dict) → dict[source]¶

Apply GaussianBlur augmentation to the given image.

Parameters: results (dict) – Results from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.RandomPatchWithLabels[source]¶

Relative patch location.

Required Keys:

img

Modified Keys:

img

Added Keys:

patch_label
patch_box
unpatched_img

Crops image into several patches and concatenates every surrounding patch with center one. Finally gives labels 0, 1, 2, 3, 4, 5, 6, 7 and patch positions.

transform(results: dict) → dict[source]¶

Apply random patch augmentation to the given image.

Parameters: results (dict) – Results from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.RandomResizedCrop(size: Union[int, Sequence[int]], scale: Tuple = (0.08, 1.0), ratio: Tuple = (0.75, 1.3333333333333333), max_attempts: int = 10, interpolation: str = 'bilinear', backend: str = 'cv2')[source]¶

Crop the given image to random size and aspect ratio.

A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size.

Required Keys:

img

Modified Keys:

img
img_shape

Parameters

size (Sequence | int) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.
scale (Tuple) – Range of the random size of the cropped image compared to the original image. Defaults to (0.08, 1.0).
ratio (Tuple) – Range of the random aspect ratio of the cropped image compared to the original image. Defaults to (3. / 4., 4. / 3.).
max_attempts (int) – Maximum number of attempts before falling back to Central Crop. Defaults to 10.
interpolation (str) – Interpolation method, accepted values are ‘nearest’, ‘bilinear’, ‘bicubic’, ‘area’, ‘lanczos’. Defaults to ‘bilinear’.
backend (str) – The image resize backend type, accepted values are cv2 and pillow. Defaults to cv2.

static get_params(img: numpy.ndarray, scale: Tuple, ratio: Tuple, max_attempts: int = 10) → Tuple[int, int, int, int][source]¶

Get parameters for crop for a random sized crop.

Parameters

img (np.ndarray) – Image to be cropped.
scale (Tuple) – Range of the random size of the cropped image compared to the original image size.
ratio (Tuple) – Range of the random aspect ratio of the cropped image compared to the original image area.
max_attempts (int) – Maximum number of attempts before falling back to central crop. Defaults to 10.

Returns

Params (ymin, xmin, ymax, xmax) to be passed to crop for: a random sized crop.

Return type

tuple

transform(results: dict) → dict[source]¶

Randomly crop the image and resize the image to the target size.

Parameters: results (dict) – Result dict from previous pipeline.
Returns: Result dict with the transformed image.
Return type: dict

class mmselfsup.datasets.transforms.RandomResizedCropAndInterpolationWithTwoPic(size: Union[tuple, int], second_size=None, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation='bilinear', second_interpolation='lanczos')[source]¶

Crop the given PIL Image to random size and aspect ratio with random interpolation.

Required Keys:

img

Modified Keys:

img

Added Keys:

target_img

This module is borrowed from https://github.com/microsoft/unilm/tree/master/beit.

A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size. This is popularly used to train the Inception networks. This module first crops the image and resizes the crop to two different sizes.

Parameters

size (Union[tuple, int]) – Expected output size of each edge of the first image.
second_size (Union[tuple, int], optional) – Expected output size of each edge of the second image.
scale (tuple[float, float]) – Range of size of the origin size cropped. Defaults to (0.08, 1.0).
ratio (tuple[float, float]) – Range of aspect ratio of the origin aspect ratio cropped. Defaults to (3./4., 4./3.).
interpolation (str) – The interpolation for the first image. Defaults to bilinear.
second_interpolation (str) – The interpolation for the second image. Defaults to lanczos.

static get_params(img: numpy.ndarray, scale: tuple, ratio: tuple) → Sequence[int][source]¶

Get parameters for crop for a random sized crop.

Parameters

img (np.ndarray) – Image to be cropped.
scale (tuple) – range of size of the origin size cropped
ratio (tuple) – range of aspect ratio of the origin aspect ratio cropped

Returns

params (i, j, h, w) to be passed to crop for a random: sized crop.

Return type

tuple

transform(results: dict) → dict[source]¶

Crop the given image and resize it to two different sizes.

This module crops the given image randomly and resize the crop to two different sizes. This is popularly used in BEiT-style masked image modeling, where an off-the-shelf model is used to provide the target.

Parameters: results (dict) – Results from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.RandomRotation(degrees: Union[int, Sequence[int]], interpolation: str = 'nearest', expand: bool = False, center: Optional[Tuple[float]] = None, fill: int = 0)[source]¶

Rotate the image by angle.

Required Keys:

img

Modified Keys:

img

Parameters

degrees (sequence | int) – Range of degrees to select from. If degrees is an int instead of sequence like (min, max), the range of degrees will be (-degrees, +degrees).
interpolation (str, optional) – Interpolation method, accepted values are ‘nearest’, ‘bilinear’, ‘bicubic’, ‘area’, ‘lanczos’. Defaults to ‘nearest’.
expand (bool, optional) – Optional expansion flag. If true, expands the output to make it large enough to hold the entire rotated image. If false or omitted, make the output image the same size as the input image. Note that the expand flag assumes rotation around the center and no translation. Defaults to False.
center (Tuple[float], optional) – Center point (w, h) of the rotation in the source image. If not specified, the center of the image will be used. Defaults to None.
fill (int, optional) – Pixel fill value for the area outside the rotated image. Default to 0.

static get_params(degrees: List[float]) → float[source]¶

Get parameters for rotate for a random rotation.

Parameters

degrees (List[float]) – Range of degrees to select from.

Returns

angle parameter to be passed to rotate for: random rotation.

Return type

float

transform(results: dict) → dict[source]¶

Randomly rotate the image.

Parameters: results (dict) – Result dict from previous pipeline.
Returns: Result dict with the transformed image.
Return type: dict

class mmselfsup.datasets.transforms.RandomSolarize(threshold: int = 128, prob: float = 0.5)[source]¶

Solarization augmentation refers to BYOL.

Paper link.

Required Keys:

img

Modified Keys:

img

Parameters

threshold (float, optional) – The solarization threshold. Defaults to 128.
prob (float, optional) – Probability. Defaults to 0.5.

transform(results: dict) → dict[source]¶

Apply Solarize augmentation to the given image.

Parameters: results (dict) – Results from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.RotationWithLabels[source]¶

Rotation prediction.

Required Keys:

img

Modified Keys:

img

Added Keys:

rot_label

Rotate each image with 0, 90, 180, and 270 degrees and give labels 0, 1, 2, 3 correspodingly.

transform(results: dict) → dict[source]¶

Apply rotation augmentation to the given image.

Parameters: results (dict) – Results from previous pipeline.
Returns: Results after applying this transformation.
Return type: dict

class mmselfsup.datasets.transforms.SimMIMMaskGenerator(input_size: int = 192, mask_patch_size: int = 32, model_patch_size: int = 4, mask_ratio: float = 0.6)[source]¶

Generate random block mask for each Image.

Added Keys:

mask

This module is used in SimMIM to generate masks.

Parameters

input_size (int) – Size of input image. Defaults to 192.
mask_patch_size (int) – Size of each block mask. Defaults to 32.
model_patch_size (int) – Patch size of each token. Defaults to 4.
mask_ratio (float) – The mask ratio of image. Defaults to 0.6.

transform(results: dict) → dict[source]¶

Method to generate random block mask for each Image in SimMIM.

Parameters: results (dict) – Result dict from previous pipeline.
Returns: Result dict with added key mask.
Return type: dict

samplers¶

class mmselfsup.datasets.samplers.DeepClusterSampler(dataset: Sized, shuffle: bool = True, seed: Optional[int] = None, replace: bool = False, round_up: bool = True)[source]¶

The sampler inherits DefaultSampler from mmengine.

This sampler supports to set replace to be True to get indices. Besides, it defines function set_uniform_indices, which is applied in DeepClusterHook.

Parameters

dataset (Sized) – The dataset.
shuffle (bool) – Whether shuffle the dataset or not. Defaults to True.
seed (int, optional) – Random seed used to shuffle the sampler if shuffle=True. This number should be identical across all processes in the distributed group. Defaults to None.
replace (bool) – Replace or not in random shuffle. It works on when shuffle is True. Defaults to False.
round_up (bool) – Whether to add extra samples to make the number of samples evenly divisible by the world size. Defaults to True.

set_uniform_indices(labels: list, num_classes: int) → None[source]¶

The function is applied in DeepClusterHook for uniform sampling.

Parameters

labels (list) – The updated labels after clustering.
num_classes (int) – number of clusters.

Returns

None

mmselfsup.engine¶

hooks¶

class mmselfsup.engine.hooks.DeepClusterHook(extract_dataloader: dict, clustering: dict, unif_sampling: bool, reweight: bool, reweight_pow: float, init_memory: bool = False, initial: bool = True, interval: int = 1, seed: Optional[int] = None)[source]¶

Hook for DeepCluster.

This hook includes the global clustering process in DC.

Parameters

extractor (dict) – Config dict for feature extraction.
clustering (dict) – Config dict that specifies the clustering algorithm.
unif_sampling (bool) – Whether to apply uniform sampling.
reweight (bool) – Whether to apply loss re-weighting.
reweight_pow (float) – The power of re-weighting.
init_memory (bool) – Whether to initialize memory banks used in ODC. Defaults to False.
initial (bool) – Whether to call the hook initially. Defaults to True.
interval (int) – Frequency of epochs to call the hook. Defaults to 1.
seed (int, optional) – Random seed. Defaults to None.

after_train_epoch(runner) → None[source]¶: Run cluster after indicated epoch.

before_train(runner) → None[source]¶: Run cluster before training.

deepcluster(runner) → None[source]¶: Call cluster algorithm.

evaluate(runner, new_labels: numpy.ndarray) → None[source]¶: Evaluate with labels histogram.

set_reweight(runner, labels: numpy.ndarray, reweight_pow: float = 0.5)[source]¶

Loss re-weighting.

Re-weighting the loss according to the number of samples in each class.

Parameters

runner (mmengine.Runner) – mmengine Runner.
labels (numpy.ndarray) – Label assignments.
reweight_pow (float, optional) – The power of re-weighting. Defaults to 0.5.

class mmselfsup.engine.hooks.DenseCLHook(start_iters: int = 1000)[source]¶

Hook for DenseCL.

This hook includes loss_lambda warmup in DenseCL. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL.

Parameters: start_iters (int) – The number of warmup iterations to set loss_lambda=0. Defaults to 1000.

before_train(runner) → None[source]¶: Obtain loss_lambda from algorithm.

before_train_iter(runner, batch_idx: int, data_batch: Optional[Sequence[dict]] = None) → None[source]¶: Adjust loss_lambda every train iter.

class mmselfsup.engine.hooks.ODCHook(centroids_update_interval: int, deal_with_small_clusters_interval: int, evaluate_interval: int, reweight: bool, reweight_pow: float, dist_mode: bool = True)[source]¶

Hook for ODC.

This hook includes the online clustering process in ODC.

Parameters

centroids_update_interval (int) – Frequency of iterations to update centroids.
deal_with_small_clusters_interval (int) – Frequency of iterations to deal with small clusters.
evaluate_interval (int) – Frequency of iterations to evaluate clusters.
reweight (bool) – Whether to perform loss re-weighting.
reweight_pow (float) – The power of re-weighting.
dist_mode (bool) – Use distributed training or not. Defaults to True.

after_train_epoch(runner) → None[source]¶: Save cluster.

after_train_iter(runner, batch_idx: int, data_batch: Optional[Sequence[dict]] = None, outputs: Optional[dict] = None) → None[source]¶: Update cluster centroids and the loss_weight.

evaluate(runner, new_labels: numpy.ndarray) → None[source]¶: Evaluate with labels histogram.

set_reweight(runner, labels: Optional[numpy.ndarray] = None, reweight_pow: float = 0.5)[source]¶

Loss re-weighting.

Re-weighting the loss according to the number of samples in each class.

Parameters

runner (mmengine.Runner) – mmengine Runner.
labels (numpy.ndarray) – Label assignments.
reweight_pow (float, optional) – The power of re-weighting. Defaults to 0.5.

class mmselfsup.engine.hooks.SimSiamHook(fix_pred_lr: bool, lr: float, adjust_by_epoch: Optional[bool] = True)[source]¶

Hook for SimSiam.

This hook is for SimSiam to fix learning rate of predictor.

Parameters

fix_pred_lr (bool) – whether to fix the lr of predictor or not.
lr (float) – the value of fixed lr.
adjust_by_epoch (bool, optional) – whether to set lr by epoch or iter. Defaults to True.

before_train_epoch(runner) → None[source]¶: fix lr of predictor by epoch.

before_train_iter(runner, batch_idx: int, data_batch: Optional[Sequence[dict]] = None) → None[source]¶: fix lr of predictor by iter.

class mmselfsup.engine.hooks.SwAVHook(batch_size: int, epoch_queue_starts: Optional[int] = 15, crops_for_assign: Optional[List[int]] = [0, 1], feat_dim: Optional[int] = 128, queue_length: Optional[int] = 0, interval: Optional[int] = 1, frozen_layers_cfg: Optional[Dict] = {})[source]¶

Hook for SwAV.

This hook builds the queue in SwAV according to epoch_queue_starts. The queue will be saved in runner.work_dir or loaded at start epoch if the path folder has queues saved before.

Parameters

batch_size (int) – the batch size per GPU for computing.
epoch_queue_starts (int, optional) – from this epoch, starts to use the queue. Defaults to 15.
crops_for_assign (list[int], optional) – list of crops id used for computing assignments. Defaults to [0, 1].
feat_dim (int, optional) – feature dimension of output vector. Defaults to 128.
queue_length (int, optional) – length of the queue (0 for no queue). Defaults to 0.
interval (int, optional) – the interval to save the queue. Defaults to 1.
frozen_layers_cfg (dict, optional) – Dict to config frozen layers. The key-value pair is layer name and its frozen iters. If frozen, the layers don’t need gradient. Defaults to dict().

after_train_epoch(runner) → None[source]¶: Save the queues locally.

before_run(runner) → None[source]¶: Check whether the queues exist locally or not.

before_train_epoch(runner) → None[source]¶: Check the queues’ state.

before_train_iter(runner, batch_idx: int, data_batch: Optional[Sequence[dict]] = None) → None[source]¶: Freeze layers before specific iters according to the config.

optimizers¶

class mmselfsup.engine.optimizers.LARS(params: Iterable, lr: float, momentum: float = 0, weight_decay: float = 0, dampening: float = 0, eta: float = 0.001, nesterov: bool = False, eps: float = 1e-08)[source]¶

Implements layer-wise adaptive rate scaling for SGD.

Based on Algorithm 1 of the following paper by You, Gitman, and Ginsburg. Large Batch Training of Convolutional Networks:.

Parameters

params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – Base learning rate.
momentum (float) – Momentum factor. Defaults to 0.
weight_decay (float) – Weight decay (L2 penalty). Defaults to 0.
dampening (float) – Dampening for momentum. Defaults to 0.
eta (float) – LARS coefficient. Defaults to 0.001.
nesterov (bool) – Enables Nesterov momentum. Defaults to False.
eps (float) – A small number to avoid dviding zero. Defaults to 1e-8.

Example

>>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9,
>>>                  weight_decay=1e-4, eta=1e-3)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

step(closure=None) → torch.Tensor[source]¶

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class mmselfsup.engine.optimizers.LearningRateDecayOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: Optional[dict] = None)[source]¶

Different learning rates are set for different layers of backbone.

Note: Currently, this optimizer constructor is built for ViT and Swin.

In addition to applying layer-wise learning rate decay schedule, the paramwise_cfg only supports weight decay customization.

add_params(params: List[dict], module: torch.nn.modules.module.Module, optimizer_cfg: dict, **kwargs) → None[source]¶

Add all parameters of module to the params list.

The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.

Parameters

params (List[dict]) – A list of param groups, it will be modified in place.
module (nn.Module) – The module to be added.
optimizer_cfg (dict) – The configuration of optimizer.
prefix (str) – The prefix of the module.

mmselfsup.evaluation¶

functional¶

mmselfsup.evaluation.functional.knn_eval(train_features: torch.Tensor, train_labels: torch.Tensor, test_features: torch.Tensor, test_labels: torch.Tensor, k: int, T: float, num_classes: int = 1000) → Tuple[float, float][source]¶

Compute accuracy of knn classifier predictions.

Parameters

train_features (Tensor) – Extracted features in the training set.
train_labels (Tensor) – Labels in the training set.
test_features (Tensor) – Extracted features in the testing set.
test_labels (Tensor) – Labels in the testing set.
k (int) – Number of NN to use.
T (float) – Temperature used in the voting coefficient.
num_classes (int) – Number of classes. Defaults to 1000.

Returns

The top1 and top5 accuracy.

Return type

Tuple[float, float]

mmselfsup.models¶

algorithms¶

class mmselfsup.models.algorithms.BEiT(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

BEiT v1/v2.

Implementation of BEiT: BERT Pre-Training of Image Transformers and BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.

loss(batch_inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

batch_inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.BYOL(backbone: dict, neck: dict, head: dict, base_momentum: float = 0.996, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

BYOL.

Implementation of Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.996.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: batch_inputs (List[torch.Tensor]) – The input images.
Returns: Backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.BarlowTwins(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

BarlowTwins.

Implementation of Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Part of the code is borrowed from: https://github.com/facebookresearch/barlowtwins/blob/main/main.py.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.BaseModel(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

BaseModel for SelfSup.

All algorithms should inherit this module.

Parameters

backbone (dict) – The backbone module. See mmcls.models.backbones.
neck (dict, optional) – The neck module to process features from backbone. See mmcls.models.necks. Defaults to None.
head (dict, optional) – The head module to do prediction and calculate loss from processed features. See mmcls.models.heads. Notice that if the head is not set, almost all methods cannot be used except extract_feat(). Defaults to None.
target_generator – (dict, optional): The target_generator module to generate targets for self-supervised learning optimization, such as HOG, extracted features from other modules(DALL-E, CLIP), etc.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (Union[dict, nn.Module], optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (dict, optional) – the config to control the initialization. Defaults to None.

extract_feat(inputs: torch.Tensor)[source]¶

Extract features from the input tensor with shape (N, C, …).

This is a abstract method, and subclass should overwrite this methods if needed.

Parameters: inputs (Tensor) – A batch of inputs. The shape of it should be (num_samples, num_channels, *img_shape).
Returns: The output of specified stage. The output depends on detailed implementation.
Return type: tuple | Tensor

forward(inputs: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, mode: str = 'tensor')[source]¶

Returns losses or predictions of training, validation, testing, and simple inference process.

This module overwrites the abstract method in BaseModel.

Parameters

inputs (torch.Tensor) – batch input tensor collated by data_preprocessor.
data_samples (List[BaseDataElement], optional) – data samples collated by data_preprocessor.
mode (str) –
mode should be one of loss, predict and tensor.
- loss: Called by train_step and return loss dict used for logging
- predict: Called by val_step and test_step and return list of BaseDataElement results used for computing metric.
- tensor: Called by custom use to get Tensor type results.

Returns

If mode == loss, return a dict of loss tensor used for backward and logging.
If mode == predict, return a list of BaseDataElement for computing metric and getting inference result.
If mode == tensor, return a tensor or tuple of tensor or ``dict of tensor for custom use.

Return type

ForwardResults (dict or list)

loss(inputs: torch.Tensor, data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]) → dict[source]¶

Calculate losses from a batch of inputs and data samples.

This is a abstract method, and subclass should overwrite this methods if needed.

Parameters

inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[SelfSupDataSample]) – The annotation data of every samples.

Returns

A dictionary of loss components.

Return type

dict[str, Tensor]

predict(inputs: tuple, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

Predict results from the extracted features.

This module returns the logits before loss, which are used to compute all kinds of metrics. This is a abstract method, and subclass should overwrite this methods if needed.

Parameters

feats (tuple) – The features extracted from the backbone.
data_samples (List[BaseDataElement], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the predict method of head.

property with_head: bool¶: Check if the model has a head module.

property with_neck: bool¶: Check if the model has a neck module.

property with_target_generator: bool¶: Check if the model has a target_generator module.

class mmselfsup.models.algorithms.CAE(backbone: dict, neck: dict, head: dict, target_generator: Optional[dict] = None, base_momentum: float = 0.0, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

CAE.

Implementation of Context Autoencoder for Self-Supervised Representation Learning.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of neck.
head (dict) – Config dict for module of head functions.
target_generator – (dict, optional): The target_generator module to generate targets for self-supervised learning optimization, such as HOG, extracted features from other modules(DALL-E, CLIP), etc.
base_momentum (float) – The base momentum coefficient for the target network. Defaults to 0.0.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

init_weights() → None[source]¶: Initialize weights.

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

momentum_update() → None[source]¶: Momentum update of the teacher network.

class mmselfsup.models.algorithms.DeepCluster(backbone: dict, neck: dict, head: dict, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

DeepCluster.

Implementation of Deep Clustering for Unsupervised Learning of Visual Features. The clustering operation is in engine/hooks/deepcluster_hook.py.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

The forward function in testing.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

List[SelfSupDataSample]

class mmselfsup.models.algorithms.DenseCL(backbone: dict, neck: dict, head: dict, queue_len: int = 65536, feat_dim: int = 128, momentum: float = 0.999, loss_lambda: float = 0.5, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

DenseCL.

Implementation of Dense Contrastive Learning for Self-Supervised Visual Pre-Training. Borrowed from the authors’ code: https://github.com/WXinlong/DenseCL. The loss_lambda warmup is in engine/hooks/densecl_hook.py.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
loss_lambda (float) – Loss weight for the single and dense contrastive loss. Defaults to 0.5.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample [source]¶

Predict results from the extracted features.

Parameters

batch_inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

SelfSupDataSample

class mmselfsup.models.algorithms.EVA(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

EVA.

Implementation of EVA: Exploring the Limits of Masked Visual Representation Learning at Scale.

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.MAE(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

MAE.

Implementation of Masked Autoencoders Are Scalable Vision Learners.

extract_feat(inputs: List[torch.Tensor], data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwarg) → Tuple[torch.Tensor][source]¶

The forward function to extract features from neck.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: Neck outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

reconstruct(features: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample [source]¶

The function is for image reconstruction.

Parameters

features (torch.Tensor) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

SelfSupDataSample

class mmselfsup.models.algorithms.MILAN(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

MILAN.

Implementation of MILAN: Masked Image Pretraining on Language Assisted Representation.

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.MaskFeat(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

MaskFeat.

Implementation of Masked Feature Prediction for Self-Supervised Visual Pre-Training.

extract_feat(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], compute_hog: bool = True, **kwarg) → Tuple[torch.Tensor][source]¶

The forward function to extract features from neck.

Parameters

inputs (List[torch.Tensor]) – The input images and mask.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.
compute_hog (bool) – Whether to compute hog during extraction. If True, the batch size of inputs need to be 1. Defaults to True.

Returns

Neck outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

reconstruct(features: List[torch.Tensor], data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample [source]¶

The function is for image reconstruction.

Parameters

features (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

SelfSupDataSample

class mmselfsup.models.algorithms.MixMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

MiXMIM.

Implementation of MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning..

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.MoCo(backbone: dict, neck: dict, head: dict, queue_len: int = 65536, feat_dim: int = 128, momentum: float = 0.999, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

MoCo.

Implementation of Momentum Contrast for Unsupervised Visual Representation Learning. Part of the code is borrowed from: https://github.com/facebookresearch/moco/blob/master/moco/builder.py.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
queue_len (int) – Number of negative keys maintained in the queue. Defaults to 65536.
feat_dim (int) – Dimension of compact feature vectors. Defaults to 128.
momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.999.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.MoCoV3(backbone: dict, neck: dict, head: dict, base_momentum: float = 0.99, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

MoCo v3.

Implementation of An Empirical Study of Training Self-Supervised Vision Transformers.

Parameters

backbone (dict) – Config dict for module of backbone
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
base_momentum (float) – Momentum coefficient for the momentum-updated encoder. Defaults to 0.99.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.NPID(backbone: dict, neck: dict, head: dict, memory_bank: dict, neg_num: int = 65536, ensure_neg: bool = False, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

NPID.

Implementation of Unsupervised Feature Learning via Non-parametric Instance Discrimination.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
memory_bank (dict) – Config dict for module of memory bank.
neg_num (int) – Number of negative samples for each image. Defaults to 65536.
ensure_neg (bool) – If False, there is a small probability that negative samples contain positive ones. Defaults to False.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, Tensor]

class mmselfsup.models.algorithms.ODC(backbone: dict, neck: dict, head: dict, memory_bank: dict, pretrained: Optional[str] = None, data_preprocessor: Optional[dict] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

ODC.

Official implementation of Online Deep Clustering for Unsupervised Representation Learning. The operation w.r.t. memory bank and loss re-weighting is in engine/hooks/odc_hook.py.

Parameters

backbone (dict) – Config dict for module of backbone.
neck (dict) – Config dict for module of deep features to compact feature vectors.
head (dict) – Config dict for module of head functions.
memory_bank (dict) – Config dict for module of memory bank.
pretrained (str, optional) – The pretrained checkpoint path, support local path and remote path. Defaults to None.
data_preprocessor (dict, optional) – The config for preprocessing input data. If None or no specified type, it will use “SelfSupDataPreprocessor” as type. See SelfSupDataPreprocessor for more details. Defaults to None.
init_cfg (Union[List[dict], dict], optional) – Config dict for weight initialization. Defaults to None.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

Backbone outputs.

Return type

Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

The forward function in testing.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

List[SelfSupDataSample]

class mmselfsup.models.algorithms.PixMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

The official implementation of PixMIM.

Implementation of PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling.

Please refer to MAE for these initialization arguments.

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.RelativeLoc(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

Relative patch location.

Implementation of Unsupervised Visual Representation Learning by Context Prediction.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: Backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

The forward function in testing.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

List[SelfSupDataSample]

class mmselfsup.models.algorithms.RotationPred(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

Rotation prediction.

Implementation of Unsupervised Representation Learning by Predicting Image Rotations.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: Backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

predict(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample][source]¶

The forward function in testing.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

List[SelfSupDataSample]

class mmselfsup.models.algorithms.SimCLR(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

SimCLR.

Implementation of A Simple Framework for Contrastive Learning of Visual Representations.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: Backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

class mmselfsup.models.algorithms.SimMIM(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

SimMIM.

Implementation of SimMIM: A Simple Framework for Masked Image Modeling.

extract_feat(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwarg) → torch.Tensor[source]¶

The forward function to extract features.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The reconstructed images.

Return type

torch.Tensor

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, Tensor]

reconstruct(features: torch.Tensor, data_samples: Optional[List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample]] = None, **kwargs) → mmselfsup.structures.selfsup_data_sample.SelfSupDataSample [source]¶

The function is for image reconstruction.

Parameters

features (torch.Tensor) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

The prediction from model.

Return type

SelfSupDataSample

class mmselfsup.models.algorithms.SimSiam(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

SimSiam.

Implementation of Exploring Simple Siamese Representation Learning. The operation of fixing learning rate of predictor is in engine/hooks/simsiam_hook.py.

extract_feat(inputs: List[torch.Tensor], **kwarg) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: Backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

The forward function in training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, Tensor]

class mmselfsup.models.algorithms.SwAV(backbone: dict, neck: Optional[dict] = None, head: Optional[dict] = None, target_generator: Optional[dict] = None, pretrained: Optional[str] = None, data_preprocessor: Optional[Union[dict, torch.nn.modules.module.Module]] = None, init_cfg: Optional[dict] = None)[source]¶

SwAV.

Implementation of Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. The queue is built in engine/hooks/swav_hook.py.

extract_feat(inputs: List[torch.Tensor], **kwargs) → Tuple[torch.Tensor][source]¶

Function to extract features from backbone.

Parameters: inputs (List[torch.Tensor]) – The input images.
Returns: backbone outputs.
Return type: Tuple[torch.Tensor]

loss(inputs: List[torch.Tensor], data_samples: List[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample], **kwargs) → Dict[str, torch.Tensor][source]¶

Forward computation during training.

Parameters

inputs (List[torch.Tensor]) – The input images.
data_samples (List[SelfSupDataSample]) – All elements required during the forward function.

Returns

A dictionary of loss components.

Return type

Dict[str, torch.Tensor]

backbones¶

class mmselfsup.models.backbones.BEiTViT(arch: str = 'base', img_size: int = 224, patch_size: int = 16, in_channels: int = 3, out_indices: int = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, avg_token: bool = False, frozen_stages: int = - 1, output_cls_token: bool = True, use_abs_pos_emb: bool = False, use_rel_pos_bias: bool = False, use_shared_rel_pos_bias: bool = True, layer_scale_init_value: int = 0.1, interpolate_mode: str = 'bicubic', patch_cfg: dict = {'padding': 0}, layer_cfgs: dict = {}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Vision Transformer for BEiT pre-training.

Rewritten version of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Parameters

arch (str | dict) –
Vision Transformer architecture. If use string, choose from ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- num_layers (int): The number of transformer encoder layers.
- num_heads (int): The number of heads in attention modules.
- feedforward_channels (int): The hidden dimensions in feedforward modules.
Defaults to ‘base’.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True.
avg_token (bool) – Whether or not to use the mean patch token for classification. If True, the model will only take the average of all patch tokens. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
use_abs_pos_emb (bool) – Whether or not use absolute position embedding. Defaults to False.
use_rel_pos_bias (bool) – Whether or not use relative position bias. Defaults to False.
use_shared_rel_pos_bias (bool) – Whether or not use shared relative position bias. Defaults to True.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor][source]¶

The BEiT style forward function.

Parameters

x (torch.Tensor) – Input images, which is of shape (B x C x H x W).
mask (torch.Tensor) – Mask for input, which is of shape (B x patch_resolution[0] x patch_resolution[1]).

Returns

Hidden features.

Return type

Tuple[torch.Tensor]

init_weights() → None[source]¶: Initialize position embedding, patch embedding and cls token.

rescale_init_weight() → None[source]¶: Rescale the initialized weights.

class mmselfsup.models.backbones.CAEViT(arch: str = 'b', img_size: int = 224, patch_size: int = 16, out_indices: int = - 1, drop_rate: float = 0, drop_path_rate: float = 0, qkv_bias: bool = True, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', init_values: Optional[float] = None, patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[dict] = None)[source]¶

Vision Transformer for CAE pre-training.

Rewritten version of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Parameters

arch (str | dict) – Vision Transformer architecture. Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
init_values (float, optional) – The init value of gamma in TransformerEncoderLayer.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(img: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Generate features for masked images.

This function generates mask images and get the hidden features for visible patches.

Parameters

x (torch.Tensor) – Input images, which is of shape B x C x H x W.
mask (torch.Tensor) – Mask for input, which is of shape B x L.

Returns

hidden features.

Return type

torch.Tensor

init_weights() → None[source]¶: Initialize position embedding, patch embedding and cls token.

class mmselfsup.models.backbones.MAEViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, mask_ratio: float = 0.75, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Vision Transformer for MAE pre-training.

A PyTorch implement of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This module implements the patch masking in MAE and initialize the position embedding with sine-cosine position embedding.

Parameters

arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
mask_ratio (bool) – The ratio of total number of patches to be masked. Defaults to 0.75.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Generate features for masked images.

This function generates mask and masks some patches randomly and get the hidden features for visible patches.

Parameters

x (torch.Tensor) – Input images, which is of shape B x C x H x W.

Returns

Hidden features, mask and the ids to restore original image.

x (torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.

mask (torch.Tensor): mask used to mask image.

ids_restore (torch.Tensor): ids to restore original image.

Return type

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

init_weights() → None[source]¶: Initialize position embedding, patch embedding and cls token.

random_masking(x: torch.Tensor, mask_ratio: float = 0.75) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Generate the mask for MAE Pre-training.

Parameters

x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.75.

Returns

masked image, mask and the ids to restore original image.

x_masked (torch.Tensor): masked image.
mask (torch.Tensor): mask used to mask image.
ids_restore (torch.Tensor): ids to restore original image.

Return type

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

class mmselfsup.models.backbones.MILANViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, mask_ratio: float = 0.75, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

MILANViT.

Implementation of the encoder for MILAN: Masked Image Pretraining on Language Assisted Representation. This module inherits from MAEViT and only overrides the forward function and replace random masking with attention masking.

attention_masking(x: torch.Tensor, mask_ratio: float, importance: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Generate attention mask for MILAN.

This is what is different from MAEViT, which uses random masking. Attention masking generates attention mask for MILAN, according to importance. The higher the importance, the more likely the patch is kept.

Parameters

x (torch.Tensor) – Input images, which is of shape B x L x C.
mask_ratio (float) – The ratio of patches to be masked.
importance (torch.Tensor) – Importance of each patch, which is of shape B x L.

Returns

masked image, mask, the ids to restore original image, ids of the shuffled patches, ids of the kept patches, ids of the removed patches.

Return type

Tuple[torch.Tensor, …]

forward(x: torch.Tensor, importance: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Generate features for masked images.

This function generates mask and masks some patches randomly and get the hidden features for visible patches. The mask is generated by importance. The higher the importance, the more likely the patch is kept. The importance is calculated by CLIP. The higher the CLIP score, the more likely the patch is kept. The CLIP score is calculated by by cross attention between the class token and all other tokens from the last layer.

Parameters

x (torch.Tensor) – Input images, which is of shape B x C x H x W.
importance (torch.Tensor) – Importance of each patch, which is of shape B x L.

Returns

masked image, the ids to restore original image, ids of the kept patches, ids of the removed patches.

x (torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.

ids_restore (torch.Tensor): ids to restore original image.

ids_keep (torch.Tensor): ids of the kept patches.

ids_dump (torch.Tensor): ids of the removed patches.

Return type

Tuple[torch.Tensor, …]

class mmselfsup.models.backbones.MaskFeatViT(arch: Union[str, dict] = 'b', img_size: int = 224, patch_size: int = 16, out_indices: Union[Sequence, int] = - 1, drop_rate: float = 0, drop_path_rate: float = 0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, final_norm: bool = True, output_cls_token: bool = True, interpolate_mode: str = 'bicubic', patch_cfg: dict = {}, layer_cfgs: dict = {}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Vision Transformer for MaskFeat pre-training.

A PyTorch implement of: Masked Feature Prediction for Self-Supervised Visual Pre-Training.

Parameters

arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Generate features for masked images.

Parameters

x (torch.Tensor) – Input images.
mask (torch.Tensor) – Input masks.

Returns

Features with cls_tokens.

Return type

torch.Tensor

init_weights() → None[source]¶: Initialize position embedding, mask token and cls token.

class mmselfsup.models.backbones.MixMIMTransformerPretrain(arch: Union[str, dict] = 'base', mlp_ratio: float = 4, img_size: int = 224, patch_size: int = 4, in_channels: int = 3, window_size: List = [14, 14, 14, 7], qkv_bias: bool = True, patch_cfg: dict = {}, norm_cfg: dict = {'type': 'LN'}, drop_rate: float = 0.0, drop_path_rate: float = 0.0, attn_drop_rate: float = 0.0, use_checkpoint: bool = False, range_mask_ratio: float = 0.0, init_cfg: Optional[dict] = None)[source]¶

MixMIM backbone during pretraining.

A PyTorch implement of : ` MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning <https://arxiv.org/abs/2205.13137>`_

Parameters

arch (str | dict) –
MixMIM architecture. If use string, choose from ‘base’,’large’ and ‘huge’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- depths (int): The number of transformer encoder layers.
- num_heads (int): The number of heads in attention modules.
Defaults to ‘base’.
mlp_ratio (int) – The mlp ratio in FFN. Defaults to 4.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to mlp_ratio the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
window_size (list) – The height and width of the window.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to an empty dict.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
attn_drop_rate (float) – Attention drop rate. Defaults to 0.
use_checkpoint (bool) – Whether use the checkpoint to
GPU memory cost (reduce) –
range_mask_ratio (float) – The range of mask ratio. Defaults to 0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor, mask_ratio=0.5)[source]¶

Generate features for masked images.

This function generates mask and masks some patches randomly and get the hidden features for visible patches.

Parameters

x (torch.Tensor) – Input images, which is of shape B x C x H x W.

Returns

x (torch.Tensor): hidden features, which is of shape B x L x C.
mask_s4 (torch.Tensor): the mask tensor for the last layer.

Return type

Tuple[torch.Tensor, torch.Tensor]

init_weights()[source]¶: Initialize position embedding, patch embedding.

random_masking(x: torch.Tensor, mask_ratio: float = 0.5)[source]¶

Generate the mask for MixMIM Pretraining.

Parameters

x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.5.

Returns

mask_s1 (torch.Tensor): mask with stride of self.encoder_stride // 8.
mask_s2 (torch.Tensor): mask with stride of self.encoder_stride // 4.
mask_s3 (torch.Tensor): mask with stride of self.encoder_stride // 2.
mask (torch.Tensor): mask with stride of self.encoder_stride.

Return type

Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

class mmselfsup.models.backbones.MoCoV3ViT(stop_grad_conv1: bool = False, frozen_stages: int = - 1, norm_eval: bool = False, init_cfg: Optional[Union[dict, List[dict]]] = None, **kwargs)[source]¶

Vision Transformer.

A pytorch implement of: An Images is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/vits.py.

Parameters

stop_grad_conv1 (bool) – whether to stop the gradient of convolution layer in PatchEmbed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

init_weights() → None[source]¶: Initialize position embedding, patch embedding, qkv layers and cls token.

train(mode: bool = True) → None[source]¶

Set module status before forward computation.

Parameters: mode (bool) – Whether it is train_mode or test_mode

class mmselfsup.models.backbones.ResNeXt(depth: int, groups: int = 32, width_per_group: int = 4, **kwargs)[source]¶

ResNeXt backbone.

Please refer to the paper for details.

As the behavior of forward function in MMSelfSup is different from MMCls, we register our own ResNeXt, inheriting from mmselfsup.model.backbone.ResNet.

Parameters

depth (int) – Network depth, from {50, 101, 152}.
groups (int) – Groups of conv2 in Bottleneck. Defaults to 32.
width_per_group (int) – Width per group of conv2 in Bottleneck. Defaults to 4.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).
out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Defaults to (3, ).
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.

Example

>>> from mmselfsup.models import ResNeXt
>>> import torch
>>> self = ResNeXt(depth=50)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 256, 8, 8)
(1, 512, 4, 4)
(1, 1024, 2, 2)
(1, 2048, 1, 1)

make_res_layer(**kwargs) → torch.nn.modules.module.Module[source]¶: Redefine the function for ResNeXt related args.

class mmselfsup.models.backbones.ResNet(depth: int, in_channels: int = 3, stem_channels: int = 64, base_channels: int = 64, expansion: Optional[int] = None, num_stages: int = 4, strides: Tuple[int] = (1, 2, 2, 2), dilations: Tuple[int] = (1, 1, 1, 1), out_indices: Tuple[int] = (4), style: str = 'pytorch', deep_stem: bool = False, avg_down: bool = False, frozen_stages: int = - 1, conv_cfg: Optional[dict] = None, norm_cfg: Optional[dict] = {'requires_grad': True, 'type': 'BN'}, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = False, init_cfg: Optional[dict] = [{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}], drop_path_rate: float = 0.0, **kwargs)[source]¶

ResNet backbone.

Please refer to the paper for details.

Parameters

depth (int) – Network depth, from {18, 34, 50, 101, 152}.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Output channels of the stem layer. Defaults to 64.
base_channels (int) – Middle channels of the first stage. Defaults to 64.
num_stages (int) – Stages of the network. Defaults to 4.
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).
out_indices (Sequence[int]) – Output from which stages. Defaults to (4, ).
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to False.
avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict | None) – The config dict for conv layers. Defaults to None.
norm_cfg (dict) – The config dict for norm layers.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False.
of the path to be zeroed. Defaults to 0.1 (Probability) –

Example

>>> from mmselfsup.models import ResNet
>>> import torch
>>> self = ResNet(depth=18)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 64, 8, 8)
(1, 128, 4, 4)
(1, 256, 2, 2)
(1, 512, 1, 1)

forward(x: torch.Tensor) → Tuple[torch.Tensor][source]¶

Forward function.

As the behavior of forward function in MMSelfSup is different from MMCls, we rewrite the forward function. MMCls does not output the feature map from the ‘stem’ layer, which will be used for downstream evaluation.

class mmselfsup.models.backbones.ResNetSobel(**kwargs)[source]¶

ResNet with Sobel layer.

This variant is used in clustering-based methods like DeepCluster to avoid color shortcut.

forward(x: torch.Tensor) → Tuple[torch.Tensor][source]¶: Forward function.

class mmselfsup.models.backbones.ResNetV1d(**kwargs)[source]¶

ResNetV1d variant described in Bag of Tricks.

Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is changed to 1.

class mmselfsup.models.backbones.SimMIMSwinTransformer(arch: Union[str, dict] = 'T', img_size: Union[Tuple[int, int], int] = 224, in_channels: int = 3, drop_rate: float = 0.0, drop_path_rate: float = 0.1, out_indices: tuple = (3), use_abs_pos_embed: bool = False, with_cp: bool = False, frozen_stages: bool = - 1, norm_eval: bool = False, norm_cfg: dict = {'type': 'LN'}, stage_cfgs: Union[Sequence, dict] = {}, patch_cfg: dict = {}, pad_small_map: bool = False, init_cfg: Optional[dict] = None)[source]¶

Swin Transformer for SimMIM.

Parameters

Args –
arch (str | dict) – Swin Transformer architecture Defaults to ‘T’.
img_size (int | tuple) – The size of input image. Defaults to 224.
in_channels (int) – The num of input channels. Defaults to 3.
drop_rate (float) – Dropout rate after embedding. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
out_indices (tuple) – Layers to be outputted. Defaults to (3, ).
use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
norm_cfg (dict) – Config dict for normalization layer at end of backone. Defaults to dict(type=’LN’)
stage_cfgs (Sequence | dict) – Extra config dict for each stage. Defaults to empty dict.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to empty dict.
pad_small_map (bool) – If True, pad the small feature map to the window size, which is common used in detection and segmentation. If False, avoid shifting window and shrink the window size to the size of feature map, which is common used in classification. Defaults to False.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.

forward(x: torch.Tensor, mask: torch.Tensor) → Sequence[torch.Tensor][source]¶

Generate features for masked images.

This function generates mask images and get the hidden features for them.

Parameters

x (torch.Tensor) – Input images.
mask (torch.Tensor) – Masks used to construct masked images.

Returns

A tuple containing features from multi-stages.

Return type

tuple

init_weights() → None[source]¶: Initialize weights.

necks¶

class mmselfsup.models.necks.AvgPool2dNeck(output_size: int = 1)[source]¶

The average pooling 2d neck.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶: Forward function.

class mmselfsup.models.necks.BEiTV2Neck(num_layers: int = 2, early_layers: int = 9, backbone_arch: str = 'base', drop_rate: float = 0.0, drop_path_rate: float = 0.0, layer_scale_init_value: float = 0.1, use_rel_pos_bias: bool = False, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶

Neck for BEiTV2 Pre-training.

This module construct the decoder for the final prediction.

Parameters

num_layers (int) – Number of encoder layers of neck. Defaults to 2.
early_layers (int) – The layer index of the early output from the backbone. Defaults to 9.
backbone_arch (str) – Vision Transformer architecture. Defaults to base.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
use_rel_pos_bias (bool) – Whether to use unique relative position bias, if False, use shared relative position bias defined in backbone.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(inputs: Tuple[torch.Tensor], rel_pos_bias: torch.Tensor, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶

Get the latent prediction and final prediction.

Parameters

x (Tuple[torch.Tensor]) – Features of tokens.
rel_pos_bias (torch.Tensor) – Shared relative position bias table.

Returns

x: The final layer features from backbone, which are normed in BEiTV2Neck.
x_cls_pt: The early state features from backbone, which are consist of final layer cls_token and early state patch_tokens from backbone and sent to PatchAggregation layers in the neck.

Return type

Tuple[torch.Tensor, torch.Tensor]

rescale_patch_aggregation_init_weight()[source]¶: Rescale the initialized weights.

class mmselfsup.models.necks.CAENeck(patch_size: int = 16, num_classes: int = 8192, embed_dims: int = 768, regressor_depth: int = 6, decoder_depth: int = 8, num_heads: int = 12, mlp_ratio: int = 4, qkv_bias: bool = True, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_values: Optional[float] = None, mask_tokens_num: int = 75, init_cfg: Optional[dict] = None)[source]¶

Neck for CAE Pre-training.

This module construct the latent prediction regressor and the decoder for the latent prediction and final prediction.

Parameters

patch_size (int) – The patch size of each token. Defaults to 16.
num_classes (int) – The number of classes for final prediction. Defaults to 8192.
embed_dims (int) – The embed dims of latent feature in regressor and decoder. Defaults to 768.
regressor_depth (int) – The number of regressor blocks. Defaults to 6.
decoder_depth (int) – The number of decoder blocks. Defaults to 8.
num_heads (int) – The number of head in multi-head attention. Defaults to 12.
mlp_ratio (int) – The expand ratio of latent features in MLP. defaults to 4.
qkv_bias (bool) – Whether or not to use qkv bias. Defaults to True.
qk_scale (float, optional) – The scale applied to the results of qk. Defaults to None.
drop_rate (float) – The dropout rate. Defaults to 0.
attn_drop_rate (float) – The dropout rate in attention block. Defaults to 0.
norm_cfg (dict) – The config of normalization layer. Defaults to dict(type=’LN’, eps=1e-6).
init_values (float, optional) – The init value of gamma. Defaults to None.
mask_tokens_num (int) – The number of mask tokens. Defaults to 75.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x_unmasked: torch.Tensor, pos_embed_masked: torch.Tensor, pos_embed_unmasked: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Get the latent prediction and final prediction.

Parameters

x_unmasked (torch.Tensor) – Features of unmasked tokens.
pos_embed_masked (torch.Tensor) – Position embedding of masked tokens.
pos_embed_unmasked (torch.Tensor) – Position embedding of unmasked tokens.

Returns

Final prediction and latent: prediction.

Return type

Tuple[torch.Tensor, torch.Tensor]

init_weights() → None[source]¶: Initialization.

class mmselfsup.models.necks.ClsBatchNormNeck(input_features: int, affine: bool = False, eps: float = 1e-06, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Normalize cls token across batch before head.

This module is proposed by MAE, when running linear probing.

Parameters

input_features (int) – The dimension of features.
affine (bool) – a boolean value that when set to True, this module has learnable affine parameters. Defaults to False.
eps (float) – a value added to the denominator for numerical stability. Defaults to 1e-6.
init_cfg (Dict or List[Dict], optional) – Config dict for weight initialization. Defaults to None.

forward(inputs: Tuple[List[torch.Tensor]]) → Tuple[List[torch.Tensor]][source]¶: The forward function.

class mmselfsup.models.necks.DenseCLNeck(in_channels: int, hid_channels: int, out_channels: int, num_grid: Optional[int] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

The non-linear neck of DenseCL.

Single and dense neck in parallel: fc-relu-fc, conv-relu-conv. Borrowed from the authors’ code.

Parameters

in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_grid (int) – The grid size of dense features. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function of neck.

Parameters

x (List[torch.Tensor]) – feature map of backbone.

Returns

The global feature: vectors and dense feature vectors. - avgpooled_x: Global feature vectors. - x: Dense feature vectors. - avgpooled_x2: Dense feature vectors for queue.

Return type

List[torch.Tensor, torch.Tensor, torch.Tensor]

class mmselfsup.models.necks.LinearNeck(in_channels: int, out_channels: int, with_avg_pool: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

The linear neck: fc only.

Parameters

in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tuple[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – The feature map of backbone.
Returns: The output features.
Return type: List[torch.Tensor]

class mmselfsup.models.necks.MAEPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, predict_feature_dim: Optional[float] = None, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Decoder for MAE Pre-training.

Some of the code is borrowed from https://github.com/facebookresearch/mae. # noqa

Parameters

num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.

Example

>>> from mmselfsup.models import MAEPretrainDecoder
>>> import torch
>>> self = MAEPretrainDecoder()
>>> self.eval()
>>> inputs = torch.rand(1, 50, 1024)
>>> ids_restore = torch.arange(0, 196).unsqueeze(0)
>>> level_outputs = self.forward(inputs, ids_restore)
>>> print(tuple(level_outputs.shape))
(1, 196, 768)

property decoder_norm¶: The normalization layer of decoder.

forward(x: torch.Tensor, ids_restore: torch.Tensor) → torch.Tensor[source]¶

The forward function.

The process computes the visible patches’ features vectors and the mask tokens to output feature vectors, which will be used for reconstruction.

Parameters

x (torch.Tensor) – hidden features, which is of shape B x (L * mask_ratio) x C.
ids_restore (torch.Tensor) – ids to restore original image.

Returns

The reconstructed feature vectors, which is of: shape B x (num_patches) x C.

Return type

x (torch.Tensor)

init_weights() → None[source]¶: Initialize position embedding and mask token of MAE decoder.

class mmselfsup.models.necks.MILANPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, predict_feature_dim: int = 512, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Prompt decoder for MILAN.

This decoder is used in MILAN pretraining, which will not update these visible tokens from the encoder.

Parameters

num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
predict_feature_dim (int) – The dimension of the feature to be predicted. Defaults to 512.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor, ids_restore: torch.Tensor, ids_keep: torch.Tensor, ids_dump: torch.Tensor) → torch.Tensor[source]¶

Forward function.

Parameters

x (torch.Tensor) – The input features, which is of shape (N, L, C).
ids_restore (torch.Tensor) – The indices to restore these tokens to the original image.
ids_keep (torch.Tensor) – The indices of tokens to be kept.
ids_dump (torch.Tensor) – The indices of tokens to be masked.

Returns

The reconstructed features, which is of shape: (N, L, C).

Return type

torch.Tensor

class mmselfsup.models.necks.MixMIMPretrainDecoder(num_patches: int = 196, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 1024, encoder_stride: int = 32, decoder_embed_dim: int = 512, decoder_depth: int = 8, decoder_num_heads: int = 16, mlp_ratio: int = 4, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Decoder for MixMIM Pretraining.

Some of the code is borrowed from https://github.com/Sense-X/MixMIM. # noqa

Parameters

num_patches (int) – The number of total patches. Defaults to 196.
patch_size (int) – Image patch size. Defaults to 16.
in_chans (int) – The channel of input image. Defaults to 3.
embed_dim (int) – Encoder’s embedding dimension. Defaults to 1024.
encoder_stride (int) – The output stride of MixMIM backbone. Defaults to 32.
decoder_embed_dim (int) – Decoder’s embedding dimension. Defaults to 512.
decoder_depth (int) – The depth of decoder. Defaults to 8.
decoder_num_heads (int) – Number of attention heads of decoder. Defaults to 16.
mlp_ratio (int) – Ratio of mlp hidden dim to decoder’s embedding dim. Defaults to 4.
norm_cfg (dict) – Normalization layer. Defaults to LayerNorm.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function.

Parameters

x (torch.Tensor) – The input features, which is of shape (N, L, C).
mask (torch.Tensor) – The tensor to indicate which tokens a re masked.

Returns

The reconstructed features, which is of shape: (N, L, C).

Return type

torch.Tensor

init_weights() → None[source]¶: Initialize position embedding and mask token of MixMIM decoder.

class mmselfsup.models.necks.MoCoV2Neck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

The non-linear neck of MoCo v2: fc-relu-fc.

Parameters

in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – The feature map of backbone.
Returns: The output features.
Return type: List[torch.Tensor]

class mmselfsup.models.necks.NonLinearNeck(in_channels: int, hid_channels: int, out_channels: int, num_layers: int = 2, with_bias: bool = False, with_last_bn: bool = True, with_last_bn_affine: bool = True, with_last_bias: bool = False, with_avg_pool: bool = True, vit_backbone: bool = False, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

The non-linear neck.

Structure: fc-bn-[relu-fc-bn] where the substructure in [] can be repeated. For the default setting, the repeated time is 1. The neck can be used in many algorithms, e.g., SimCLR, BYOL, SimSiam.

Parameters

in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
num_layers (int) – Number of fc layers. Defaults to 2.
with_bias (bool) – Whether to use bias in fc layers (except for the last). Defaults to False.
with_last_bn (bool) – Whether to add the last BN layer. Defaults to True.
with_last_bn_affine (bool) – Whether to have learnable affine parameters in the last BN layer (set False for SimSiam). Defaults to True.
with_last_bias (bool) – Whether to use bias in the last fc layer. Defaults to False.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
vit_backbone (bool) – The key to indicate whether the upstream backbone is ViT. Defaults to False.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x: Tuple[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – The feature map of backbone.
Returns: The output features.
Return type: List[torch.Tensor]

class mmselfsup.models.necks.ODCNeck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

The non-linear neck of ODC: fc-bn-relu-dropout-fc-relu.

Parameters

in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – The feature map of backbone.
Returns: The output features.
Return type: List[torch.Tensor]

class mmselfsup.models.necks.RelativeLocNeck(in_channels: int, out_channels: int, with_avg_pool: bool = True, norm_cfg: dict = {'type': 'BN1d'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

The neck of relative patch location: fc-bn-relu-dropout.

Parameters

in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’BN1d’).
init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – The feature map of backbone.
Returns: The output features.
Return type: List[torch.Tensor]

class mmselfsup.models.necks.SimMIMNeck(in_channels: int, encoder_stride: int)[source]¶

Pre-train Neck For SimMIM.

This neck reconstructs the original image from the shrunk feature map.

Parameters

in_channels (int) – Channel dimension of the feature map.
encoder_stride (int) – The total stride of the encoder.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Forward function.

class mmselfsup.models.necks.SwAVNeck(in_channels: int, hid_channels: int, out_channels: int, with_avg_pool: bool = True, with_l2norm: bool = True, norm_cfg: dict = {'type': 'SyncBN'}, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

The non-linear neck of SwAV: fc-bn-relu-fc-normalization.

Parameters

in_channels (int) – Number of input channels.
hid_channels (int) – Number of hidden channels.
out_channels (int) – Number of output channels.
with_avg_pool (bool) – Whether to apply the global average pooling after backbone. Defaults to True.
with_l2norm (bool) – whether to normalize the output after projection. Defaults to True.
norm_cfg (dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’SyncBN’).
init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x: List[torch.Tensor]) → List[torch.Tensor][source]¶

Forward function.

Parameters: x (List[torch.Tensor]) – list of feature maps, len(x) according to len(num_crops).
Returns: The projection vectors.
Return type: List[torch.Tensor]

forward_projection(x: torch.Tensor) → torch.Tensor[source]¶

Compute projection.

Parameters: x (torch.Tensor) – The feature vectors after pooling.
Returns: The output features with projection or L2-norm.
Return type: torch.Tensor

heads¶

class mmselfsup.models.heads.BEiTV1Head(embed_dims: int, num_embed: int, loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶

Pretrain Head for BEiT v1.

Compute the logits and the cross entropy loss.

Parameters

embed_dims (int) – The dimension of embedding.
num_embed (int) – The number of classification types.
loss (dict) – The config of loss.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.

forward(feats: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Generate loss.

Parameters

feats (torch.Tensor) – Features from backbone.
target (torch.Tensor) – Target generated by target_generator.
mask (torch.Tensor) – Generated mask for pretraing.

class mmselfsup.models.heads.BEiTV2Head(embed_dims: int, num_embed: int, loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = {'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶

Pretrain Head for BEiT.

Compute the logits and the cross entropy loss.

Parameters

embed_dims (int) – The dimension of embedding.
num_embed (int) – The number of classification types.
loss (dict) – The config of loss.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.

forward(feats: torch.Tensor, feats_cls_pt: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Generate loss.

Parameters

feats (torch.Tensor) – Features from backbone.
feats_cls_pt (torch.Tensor) – Features from class late layers for pretraining.
target (torch.Tensor) – Target generated by target_generator.
mask (torch.Tensor) – Generated mask for pretraing.

class mmselfsup.models.heads.CAEHead(loss: dict, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Pretrain Head for CAE.

Compute the align loss and the main loss. In addition, this head also generates the prediction target generated by dalle.

Parameters

loss (dict) – The config of loss.
tokenizer_path (str) – The path of the tokenizer.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.

forward(logits: torch.Tensor, logits_target: torch.Tensor, latent_pred: torch.Tensor, latent_target: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Generate loss.

Parameters

logits (torch.Tensor) – Logits generated by decoder.
logits_target (img_target) – Target generated by dalle for decoder prediction.
latent_pred (torch.Tensor) – Latent prediction by regressor.
latent_target (torch.Tensor) – Target for latent prediction, generated by teacher.

Returns

The tuple of loss.

loss_main (torch.Tensor): Cross entropy loss.
loss_align (torch.Tensor): MSE loss.

Return type

Tuple[torch.Tensor, torch.Tensor]

class mmselfsup.models.heads.ClsHead(loss: dict, with_avg_pool: bool = False, in_channels: int = 2048, num_classes: int = 1000, vit_backbone: bool = False, init_cfg: Optional[Union[dict, List[dict]]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

Simplest classifier head, with only one fc layer.

Parameters

loss (dict) – Config of the loss.
with_avg_pool (bool) – Whether to apply the average pooling after neck. Defaults to False.
in_channels (int) – Number of input channels. Defaults to 2048.
num_classes (int) – Number of classes. Defaults to 1000.
init_cfg (Dict or List[Dict], optional) – Initialization config dict.

forward(x: Union[List[torch.Tensor], Tuple[torch.Tensor]], label: torch.Tensor) → torch.Tensor[source]¶

Get the loss.

Parameters

x (List[Tensor] | Tuple[Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
label (torch.Tensor) – The label for cross entropy loss.

Returns

The cross entropy loss.

Return type

torch.Tensor

logits(x: Union[List[torch.Tensor], Tuple[torch.Tensor]]) → List[torch.Tensor][source]¶

Get the logits before the cross_entropy loss.

This module is used to obtain the logits before the loss.

Parameters: x (List[Tensor] | Tuple[Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
Returns: A list of class scores.
Return type: List[Tensor]

class mmselfsup.models.heads.ContrastiveHead(loss: dict, temperature: float = 0.1)[source]¶

Head for contrastive learning.

The contrastive loss is implemented in this head and is used in SimCLR, MoCo, DenseCL, etc.

Parameters

loss (dict) – Config dict for module of loss functions.
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 0.1.

forward(pos: torch.Tensor, neg: torch.Tensor) → torch.Tensor[source]¶

Forward function to compute contrastive loss.

Parameters

pos (torch.Tensor) – Nx1 positive similarity.
neg (torch.Tensor) – Nxk negative similarity.

Returns

The contrastive loss.

Return type

torch.Tensor

class mmselfsup.models.heads.LatentCrossCorrelationHead(in_channels: int, loss: dict)[source]¶

Head for latent feature cross correlation.

Part of the code is borrowed from script.

Parameters

in_channels (int) – Number of input channels.
loss (dict) – Config dict for module of loss functions.

forward(input: torch.Tensor, target: torch.Tensor) → torch.Tensor[source]¶

Forward head.

Parameters

input (torch.Tensor) – NxC input features.
target (torch.Tensor) – NxC target features.

Returns

The cross correlation loss.

Return type

torch.Tensor

class mmselfsup.models.heads.LatentPredictHead(loss: dict, predictor: dict)[source]¶

Head for latent feature prediction.

This head builds a predictor, which can be any registered neck component. For example, BYOL and SimSiam call this head and build NonLinearNeck. It also implements similarity loss between two forward features.

Parameters

loss (dict) – Config dict for the loss.
predictor (dict) – Config dict for the predictor.

forward(input: torch.Tensor, target: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward head.

Parameters

input (torch.Tensor) – NxC input features.
target (torch.Tensor) – NxC target features.

Returns

The latent predict loss.

Return type

torch.Tensor

class mmselfsup.models.heads.MAEPretrainHead(loss: dict, norm_pix: bool = False, patch_size: int = 16)[source]¶

Pre-training head for MAE.

Parameters

loss (dict) – Config of loss.
norm_pix_loss (bool) – Whether or not normalize target. Defaults to False.
patch_size (int) – Patch size. Defaults to 16.

construct_target(target: torch.Tensor) → torch.Tensor[source]¶

Construct the reconstruction target.

In addition to splitting images into tokens, this module will also normalize the image according to norm_pix.

Parameters: target (torch.Tensor) – Image with the shape of B x 3 x H x W
Returns: Tokenized images with the shape of B x L x C
Return type: torch.Tensor

forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function of MAE head.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

patchify(imgs: torch.Tensor) → torch.Tensor[source]¶

Split images into non-overlapped patches.

Parameters: imgs (torch.Tensor) – A batch of images, of shape B x H x W x C.
Returns: Patchified images. The shape is B x L x D.
Return type: torch.Tensor

unpatchify(x: torch.Tensor) → torch.Tensor[source]¶

Combine non-overlapped patches into images.

Parameters: x (torch.Tensor) – The shape is (N, L, patch_size**2 *3)
Returns: The shape is (N, 3, H, W)
Return type: imgs (torch.Tensor)

class mmselfsup.models.heads.MILANPretrainHead(loss: dict)[source]¶

MILAN pretrain head.

Parameters: loss (dict) – Config of loss.

forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶

Forward function.

Parameters

pred (torch.Tensor) – Predicted features, of shape (N, L, D).
target (torch.Tensor) – Target features, of shape (N, L, D).
mask (torch.Tensor) – The mask of the target image of shape.

Returns

the reconstructed loss.

Return type

torch.Tensor

class mmselfsup.models.heads.MaskFeatPretrainHead(loss: dict)[source]¶

Pre-training head for MaskFeat.

It computes reconstruction loss between prediction and target in masked region.

Parameters: loss (dict) – Config dict for module of loss functions.

forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward head.

Parameters

latent (torch.Tensor) – Predictions, which is of shape B x (1 + L) x C.
target (torch.Tensor) – Hog features, which is of shape B x L x C.
mask (torch.Tensor) – The mask of the hog features, which is of shape B x H x W.

Returns

The loss tensor.

Return type

torch.Tensor

class mmselfsup.models.heads.MixMIMPretrainHead(loss: dict, norm_pix: bool = False, patch_size: int = 16)[source]¶

MixMIM pretrain head.

Parameters

loss (dict) – Config of loss.
norm_pix_loss (bool) – Whether or not normalize target. Defaults to False.
patch_size (int) – Patch size. Defaults to 16.

forward(x_rec: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function of MixMIM head.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

class mmselfsup.models.heads.MoCoV3Head(predictor: dict, loss: dict, temperature: float = 1.0)[source]¶

Head for MoCo v3 algorithms.

This head builds a predictor, which can be any registered neck component. It also implements latent contrastive loss between two forward features. Part of the code is modified from: https://github.com/facebookresearch/moco-v3/blob/main/moco/builder.py.

Parameters

predictor (dict) – Config dict for module of predictor.
loss (dict) – Config dict for module of loss functions.
temperature (float) – The temperature hyper-parameter that controls the concentration level of the distribution. Defaults to 1.0.

forward(base_out: torch.Tensor, momentum_out: torch.Tensor) → torch.Tensor[source]¶

Forward head.

Parameters

base_out (torch.Tensor) – NxC features from base_encoder.
momentum_out (torch.Tensor) – NxC features from momentum_encoder.

Returns

The loss tensor.

Return type

torch.Tensor

class mmselfsup.models.heads.MultiClsHead(backbone: str = 'resnet50', in_indices: Sequence[int] = (0, 1, 2, 3, 4), pool_type: str = 'adaptive', num_classes: int = 1000, loss: dict = {'loss_weight': 1.0, 'type': 'mmcls.CrossEntropyLoss'}, with_last_layer_unpool: bool = False, cal_acc: bool = False, topk: Union[int, Tuple[int]] = (1), norm_cfg: dict = {'type': 'BN'}, init_cfg: Union[dict, List[dict]] = [{'type': 'Normal', 'std': 0.01, 'layer': 'Linear'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

Multiple classifier heads.

This head inputs feature maps from different stages of backbone, average pools each feature map to around 9000 dimensions, and then appends a linear classifier at each stage to predict corresponding class scores.

Parameters

backbone (str) – Specify which backbone to use, only support ResNet50. Defaults to ‘resnet50’.
in_indices (Sequence[int]) – Input from which stages. Defaults to (0, 1, 2, 3, 4).
pool_type (str) – ‘adaptive’ or ‘specified’. If set to ‘adaptive’, use adaptive average pooling, otherwise use specified pooling params. Defaults to ‘adaptive’.
num_classes (int) – Number of classes. Defaults to 1000.
loss (dict) – The dict of loss information. Defaults to ‘mmcls.models.CrossEntro): Whether to unpool the features from last layer. Defaults to False.
cal_acc (bool) – Whether to calculate accuracy during training. If you use batch augmentations like Mixup and CutMix during training, it is pointless to calculate accuracy. Defaults to False.
topk (int | Tuple[int]) – Top-k accuracy. Defaults to (1, ).
norm_cfg (dict) – Dict to construct and config norm layer. Defaults to dict(type='BN').
init_cfg (dict or List[dict]) – Initialization config dict. Defaults to [ dict(type='Normal', std=0.01, layer='Linear'), dict(type='Constant', val=1, layer=['_BatchNorm', 'GroupNorm']) ]

forward(feats: Union[list, tuple]) → list[source]¶

Compute multi-head scores.

Parameters: feats (Sequence[torch.Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
Returns: A list of class scores.
Return type: List[torch.Tensor]

loss(feats: Sequence[torch.Tensor], data_samples: List[mmcls.structures.cls_data_sample.ClsDataSample], **kwargs) → dict[source]¶

Calculate losses from the extracted features.

Parameters

x (Sequence[torch.Tensor]) – Feature maps of backbone, each tensor has shape (N, C, H, W).
gt_label (torch.Tensor) – The ground truth label.

Returns

Dict of loss and accuracy.

Return type

Dict[str, torch.Tensor]

predict(feats: Sequence[torch.Tensor], data_samples: List[mmcls.structures.cls_data_sample.ClsDataSample]) → List[mmcls.structures.cls_data_sample.ClsDataSample][source]¶

Inference without augmentation.

Parameters

feats (tuple[Tensor]) – The extracted features.
data_samples (List[BaseDataElement], optional) – The annotation data of every samples. If not None, set pred_label of the input data samples.

Returns

The data samples containing annotation,: prediction, etc.

Return type

List[BaseDataElement]

class mmselfsup.models.heads.SimMIMHead(patch_size: int, loss: dict)[source]¶

Pretrain Head for SimMIM.

Parameters

patch_size (int) – Patch size of each token.
loss (dict) – The config for loss.

forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function of MAE Loss.

This method will expand mask to the size of the original image.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

class mmselfsup.models.heads.SwAVHead(loss: dict)[source]¶

Head for SwAV.

Parameters: loss (dict) – Config dict for module of loss functions.

forward(pred: torch.Tensor) → torch.Tensor[source]¶

Forward function of SwAV head.

Parameters: pred (torch.Tensor) – NxC input features.
Returns: The SwAV loss.
Return type: torch.Tensor

losses¶

class mmselfsup.models.losses.BEiTLoss[source]¶

Loss function for BEiT.

The BEiTLoss supports 2 diffenrent logits shared 1 target, like BEiT v2.

forward(logits: Union[Tuple[torch.Tensor], torch.Tensor], target: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward function of BEiT Loss.

Parameters

logits (torch.Tensor) – The outputs from the decoder.
target (torch.Tensor) – The targets generated by dalle.

Returns

The main loss.

Return type

Tuple[torch.Tensor, torch.Tensor]

class mmselfsup.models.losses.CAELoss(lambd: float)[source]¶

Loss function for CAE.

Compute the align loss and the main loss.

Parameters: lambd (float) – The weight for the align loss.

forward(logits: torch.Tensor, target: torch.Tensor, latent_pred: torch.Tensor, latent_target: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward function of CAE Loss.

Parameters

logits (torch.Tensor) – The outputs from the decoder.
target (torch.Tensor) – The targets generated by dalle.
latent_pred (torch.Tensor) – The latent prediction from the regressor.
latent_target (torch.Tensor) – The latent target from the teacher network.

Returns

The main loss and align loss.

Return type

Tuple[torch.Tensor, torch.Tensor]

class mmselfsup.models.losses.CosineSimilarityLoss(shift_factor: float = 0.0, scale_factor: float = 1.0)[source]¶

Cosine similarity loss function.

Compute the similarity between two features and optimize that similarity as loss.

Parameters

shift_factor (float) – The shift factor of cosine similarity. Default: 0.0.
scale_factor (float) – The scale factor of cosine similarity. Default: 1.0.

forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶

Forward function of cosine similarity loss.

Parameters

pred (torch.Tensor) – The predicted features.
target (torch.Tensor) – The target features.

Returns

The cosine similarity loss.

Return type

torch.Tensor

class mmselfsup.models.losses.CrossCorrelationLoss(lambd: float = 0.0051)[source]¶

Cross correlation loss function.

Compute the on-diagnal and off-diagnal loss.

Parameters: lambd (float) – The weight for the off-diag loss.

forward(cross_correlation_matrix: torch.Tensor) → torch.Tensor[source]¶

Forward function of cross correlation loss.

Parameters: cross_correlation_matrix (torch.Tensor) – The cross correlation matrix.
Returns: cross correlation loss.
Return type: torch.Tensor

off_diagonal(x: torch.Tensor) → torch.Tensor[source]¶: Rreturn a flattened view of the off-diagonal elements of a square matrix.

class mmselfsup.models.losses.MAEReconstructionLoss[source]¶

Loss function for MAE.

Compute the loss in masked region.

forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function of MAE Loss.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

class mmselfsup.models.losses.PixelReconstructionLoss(criterion: str, channel: Optional[int] = None)[source]¶

Loss for the reconstruction of pixel in Masked Image Modeling.

This module measures the distance between the target image and the reconstructed image and compute the loss to optimize the model. Currently, This module only provides L1 and L2 loss to penalize the reconstructed error. In addition, a mask can be passed in the forward function to only apply loss on visible region, like that in MAE.

Parameters

criterion (str) – The loss the penalize the reconstructed error. Currently, only supports L1 and L2 loss
channel (int, optional) – The number of channels to average the reconstruction loss. If not None, the reconstruction loss will be divided by the channel. Defaults to None.

forward(pred: torch.Tensor, target: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶

Forward function to compute the reconstrction loss.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

class mmselfsup.models.losses.SimMIMReconstructionLoss(encoder_in_channels: int)[source]¶

Loss function for MAE.

Compute the loss in masked region.

Parameters: encoder_in_channels (int) – Number of input channels for encoder.

forward(pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward function of MAE Loss.

Parameters

pred (torch.Tensor) – The reconstructed image.
target (torch.Tensor) – The target image.
mask (torch.Tensor) – The mask of the target image.

Returns

The reconstruction loss.

Return type

torch.Tensor

class mmselfsup.models.losses.SwAVLoss(feat_dim: int, sinkhorn_iterations: int = 3, epsilon: float = 0.05, temperature: float = 0.1, crops_for_assign: List[int] = [0, 1], num_crops: List[int] = [2], num_prototypes: int = 3000, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

The Loss for SwAV.

This Loss contains clustering and sinkhorn algorithms to compute Q codes. Part of the code is borrowed from script. The queue is built in engine/hooks/swav_hook.py.

Parameters

feat_dim (int) – feature dimension of the prototypes.
sinkhorn_iterations (int) – number of iterations in Sinkhorn-Knopp algorithm. Defaults to 3.
epsilon (float) – regularization parameter for Sinkhorn-Knopp algorithm. Defaults to 0.05.
temperature (float) – temperature parameter in training loss. Defaults to 0.1.
crops_for_assign (List[int]) – list of crops id used for computing assignments. Defaults to [0, 1].
num_crops (List[int]) – list of number of crops. Defaults to [2].
num_prototypes (int) – number of prototypes. Defaults to 3000.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward function of SwAV loss.

Parameters: x (torch.Tensor) – NxC input features.
Returns: The returned loss.
Return type: torch.Tensor

memories¶

class mmselfsup.models.memories.ODCMemory(length: int, feat_dim: int, momentum: float, num_classes: int, min_cluster: int, **kwargs)[source]¶

Memory module for ODC.

This module includes the samples memory and the centroids memory in ODC. The samples memory stores features and pseudo-labels of all samples in the dataset; while the centroids memory stores features of cluster centroids.

Parameters

length (int) – Number of features stored in the samples memory.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.
num_classes (int) – Number of clusters.
min_cluster (int) – Minimal cluster size.

deal_with_small_clusters() → None[source]¶: Deal with small clusters.

init_memory(feature: numpy.ndarray, label: numpy.ndarray) → None[source]¶: Initialize memory modules.

update_centroids_memory(cinds: Optional[List] = None) → None[source]¶: Update centroids memory.

update_samples_memory(idx: torch.Tensor, feature: torch.Tensor) → torch.Tensor[source]¶: Update samples memory.

class mmselfsup.models.memories.SimpleMemory(length: int, feat_dim: int, momentum: float, **kwargs)[source]¶

Simple feature memory bank.

This module includes the memory bank that stores running average features of all samples in the dataset. It is used in algorithms like NPID.

Parameters

length (int) – Number of features stored in the memory bank.
feat_dim (int) – Dimension of stored features.
momentum (float) – Momentum coefficient for updating features.

update(idx: torch.Tensor, feature: torch.Tensor) → None[source]¶

Update features in the memory bank.

Parameters

idx (torch.Tensor) – Indices for the batch of features.
feature (torch.Tensor) – Batch of features.

target_generators¶

class mmselfsup.models.target_generators.CLIPGenerator(tokenizer_path: str)[source]¶

Get the features and attention from the last layer of CLIP.

This module is used to generate target features in masked image modeling.

Parameters: tokenizer_path (str) – The path of the checkpoint of CLIP.

forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Get the features and attention from the last layer of CLIP.

Parameters: x (torch.Tensor) – The input image, which is of shape (N, 3, H, W).
Returns: The features and attention from the last layer of CLIP, which are of shape (N, L, C) and (N, L, L), respectively.
Return type: Tuple[torch.Tensor, torch.Tensor]

class mmselfsup.models.target_generators.Encoder(n_hid: int = 256, n_blk_per_group: int = 2, input_channels: int = 3, vocab_size: int = 8192, device: torch.device = device(type='cpu'), requires_grad: bool = False, use_mixed_precision: bool = True, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

forward(x: torch.Tensor) → torch.Tensor[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmselfsup.models.target_generators.HOGGenerator(nbins: int = 9, pool: int = 8, gaussian_window: int = 16)[source]¶

Generate HOG feature for images.

This module is used in MaskFeat to generate HOG feature. The code is modified from file slowfast/models/operators.py. Here is the link of HOG wikipedia.

Parameters

nbins (int) – Number of bin. Defaults to 9.
pool (float) – Number of cell. Defaults to 8.
gaussian_window (int) – Size of gaussian kernel. Defaults to 16.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Generate hog feature for each batch images.

Parameters: x (torch.Tensor) – Input images of shape (N, 3, H, W).
Returns: Hog features.
Return type: torch.Tensor

generate_hog_image(hog_out: torch.Tensor) → numpy.ndarray[source]¶: Generate HOG image according to HOG features.

get_gaussian_kernel(kernlen: int, std: int) → torch.Tensor[source]¶: Returns a 2D Gaussian kernel array.

class mmselfsup.models.target_generators.LowFreqTargetGenerator(radius: int, img_size: Union[int, Tuple[int, int]])[source]¶

Generate low-frquency target for images.

This module is used in PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling to remove these high-frequency information from images.

Parameters

radius (int) – radius of low pass filter.
img_size (Union[int, Tuple[int, int]]) – size of input images.

forward(imgs: torch.Tensor) → torch.Tensor[source]¶

Filter out these high frequency components from images.

Parameters

imgs (torch.Tensor) – input images, which has shape (N, C, H, W).

Returns

low frequency target, which has the same shape as: input images.

Return type

torch.Tensor

class mmselfsup.models.target_generators.VQKD(encoder_config: dict, decoder_config: Optional[dict] = None, num_embed: int = 8192, embed_dims: int = 32, decay: float = 0.99, beta: float = 1.0, quantize_kmeans_init: bool = True, init_cfg: Optional[dict] = None)[source]¶

Vector-Quantized Knowledge Distillation.

The module only contains encoder and VectorQuantizer part Modified from https://github.com/microsoft/unilm/blob/master/beit2/modeling_vqkd.py

Parameters

encoder_config (dict) – The config of encoder.
decoder_config (dict, optional) – The config of decoder. Currently, VQKD only support to build encoder. Defaults to None.
num_embed (int) – Number of embedding vectors in the codebook. Defaults to 8192.
embed_dims (int) – The dimension of embedding vectors in the codebook. Defaults to 32.
decay (float) – The decay parameter of EMA. Defaults to 0.99.
beta (float) – The mutiplier for VectorQuantizer loss. Defaults to 1.
quantize_kmeans_init (bool) – Whether to use k-means to initialize the VectorQuantizer. Defaults to True.
init_cfg (dict or List[dict], optional) – Initialization config dict. Defaults to None.

encode(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶: Encode the input images and get corresponding results.

forward(x: torch.Tensor) → torch.Tensor[source]¶

The forward function.

Currently, only support to get tokens.

get_tokens(x: torch.Tensor) → dict[source]¶: Get tokens for beit pre-training.

utils¶

class mmselfsup.models.utils.CAEDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶

Image pre-processor for CAE.

Compared with the mmselfsup.SelfSupDataPreprocessor, this module will normalize the prediction image and target image with different normalization parameters.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format as the model input.

Return type

Tuple[torch.Tensor, Optional[list]]

class mmselfsup.models.utils.CAETransformerRegressorLayer(embed_dims: int, num_heads: int, feedforward_channels: int, num_fcs: int = 2, qkv_bias: bool = False, qk_scale: Optional[float] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, init_values: float = 0.0, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'eps': 1e-06, 'type': 'LN'})[source]¶

Transformer layer for the regressor of CAE.

This module is different from conventional transformer encoder layer, for its queries are the masked tokens, but its keys and values are the concatenation of the masked and unmasked tokens.

Parameters

embed_dims (int) – The feature dimension.
num_heads (int) – The number of heads in multi-head attention.
feedforward_channels (int) – The hidden dimension of FFNs. Defaults: 1024.
num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.
qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.
drop_rate (float) – The dropout rate. Defaults to 0.0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
init_values (float) – The init values of gamma. Defaults to 0.0.
act_cfg (dict) – The activation config for FFNs. Defaluts to dict(type='GELU').
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').

forward(x_q: torch.Tensor, x_kv: torch.Tensor, pos_q: torch.Tensor, pos_k: torch.Tensor) → torch.Tensor[source]¶: Forward function.

class mmselfsup.models.utils.CosineEMA(model: torch.nn.modules.module.Module, momentum: float = 0.996, end_momentum: float = 1.0, interval: int = 1, device: Optional[torch.device] = None, update_buffers: bool = False)[source]¶

CosineEMA is implemented for updating momentum parameter, used in BYOL, MoCoV3, etc.

The momentum parameter is updated with cosine annealing, including momentum adjustment following:

\[m = m_1 - (m_1 - m_0) * (cos(pi * k / K) + 1) / 2\]

where \(k\) is the current step, \(K\) is the total steps.

Parameters

model (nn.Module) – The model to be averaged.
momentum (float) – The momentum used for updating ema parameter. Ema’s parameter are updated with the formula: averaged_param = momentum * averaged_param + (1-momentum) * source_param. Defaults to 0.996.
end_momentum (float) – The end momentum value for cosine annealing. Defaults to 1.
interval (int) – Interval between two updates. Defaults to 1.
device (torch.device, optional) – If provided, the averaged model will be stored on the device. Defaults to None.
update_buffers (bool) – if True, it will compute running averages for both the parameters and the buffers of the model. Defaults to False.

avg_func(averaged_param: torch.Tensor, source_param: torch.Tensor, steps: int) → None[source]¶

Compute the moving average of the parameters using the cosine momentum strategy.

Parameters

averaged_param (Tensor) – The averaged parameters.
source_param (Tensor) – The source parameters.
steps (int) – The number of times the parameters have been updated.

Returns

The averaged parameters.

Return type

Tensor

class mmselfsup.models.utils.Extractor(extract_dataloader: Union[torch.utils.data.dataloader.DataLoader, dict], seed: Optional[int] = None, dist_mode: bool = False, pool_cfg: Optional[dict] = None, **kwargs)[source]¶

Feature extractor.

The extractor support to build its own DataLoader, customized models, pooling type. It also has distributed and non-distributed mode.

Parameters

extract_dataloader (dict) – A dict to build DataLoader object.
seed (int, optional) – Random seed. Defaults to None.
dist_mode (bool) – Use distributed extraction or not. Defaults to False.
pool_cfg (dict, optional) – The configs of pooling. Defaults to dict(type=’AvgPool2d’, output_size=1).

class mmselfsup.models.utils.GatherLayer(*args, **kwargs)[source]¶

Gather tensors from all process, supporting backward propagation.

static backward(ctx: Any, *grads: torch.Tensor) → torch.Tensor[source]¶

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx: Any, input: torch.Tensor) → Tuple[List][source]¶

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class mmselfsup.models.utils.MultiPooling(pool_type: str = 'adaptive', in_indices: tuple = (0), backbone: str = 'resnet50')[source]¶

Pooling layers for features from multiple depth.

Parameters

pool_type (str) – Pooling type for the feature map. Options are ‘adaptive’ and ‘specified’. Defaults to ‘adaptive’.
in_indices (Sequence[int]) – Output from which backbone stages. Defaults to (0, ).
backbone (str) – The selected backbone. Defaults to ‘resnet50’.

forward(x: Union[List, Tuple]) → None[source]¶: Forward function.

class mmselfsup.models.utils.MultiPrototypes(output_dim: int, num_prototypes: List[int])[source]¶

Multi-prototypes for SwAV head.

Parameters

output_dim (int) – The output dim from SwAV neck.
num_prototypes (List[int]) – The number of prototypes needed.

forward(x: torch.Tensor) → List[torch.Tensor][source]¶: Run forward for every prototype.

class mmselfsup.models.utils.MultiheadAttention(embed_dims: int, num_heads: int, input_dims: Optional[int] = None, attn_drop: float = 0.0, proj_drop: float = 0.0, qkv_bias: bool = True, qk_scale: Optional[float] = None, proj_bias: bool = True, init_cfg: Optional[dict] = None)[source]¶

Multi-head Attention Module.

This module rewrite the MultiheadAttention by replacing qkv bias with customized qkv bias, in addition to removing the drop path layer.

Parameters

embed_dims (int) – The embedding dimension.
num_heads (int) – Parallel attention heads.
input_dims (int, optional) – The input dimension, and if None, use embed_dims. Defaults to None.
attn_drop (float) – Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0.
proj_drop (float) – Dropout rate of the dropout layer after the output projection. Defaults to 0.
dropout_layer (dict) – The dropout config before adding the shortcut. Defaults to dict(type='Dropout', drop_prob=0.).
qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.
proj_bias (bool) – Defaults to True.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Forward function.

class mmselfsup.models.utils.NormEMAVectorQuantizer(num_embed: int, embed_dims: int, beta: float, decay: float = 0.99, statistic_code_usage: bool = True, kmeans_init: bool = True, codebook_init_path: Optional[str] = None)[source]¶

Normed EMA vector quantizer module.

Parameters

num_embed (int) – Number of embedding vectors in the codebook. Defaults to 8192.
embed_dims (int) – The dimension of embedding vectors in the codebook. Defaults to 32.
beta (float) – The mutiplier for VectorQuantizer embedding loss. Defaults to 1.
decay (float) – The decay parameter of EMA. Defaults to 0.99.
statistic_code_usage (bool) – Whether to use cluster_size to record statistic. Defaults to True.
kmeans_init (bool) – Whether to use k-means to initialize the VectorQuantizer. Defaults to True.
codebook_init_path (str) – The initialization checkpoint for codebook. Defaults to None.

forward(z)[source]¶: Forward function.

class mmselfsup.models.utils.PromptTransformerEncoderLayer(embed_dims: int, num_heads: int, feedforward_channels=<class 'int'>, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, num_fcs: int = 2, qkv_bias: bool = True, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'type': 'LN'}, init_cfg: Optional[Union[dict, List[dict]]] = None)[source]¶

Prompt Transformer Encoder Layer for MILAN.

This module is specific for the prompt encoder in MILAN. It will not update the visible tokens from the encoder.

Parameters

embed_dims (int) – The feature dimension.
num_heads (int) – Parallel attention heads.
feedforward_channels (int) – The hidden dimension for FFNs.
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Defaults to 0.0.
attn_drop_rate (float) – The drop out rate for attention layer. Defaults to 0.0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.0.
num_fcs (int) – The number of fully-connected layers for FFNs. Defaults to 2.
qkv_bias (bool) – Enable bias for qkv if True. Defaults to True.
act_cfg (dict) – The activation config for FFNs. Defaluts to dict(type='GELU').
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
batch_first (bool) – Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Defaults to False.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.

forward(x: torch.Tensor, visible_tokens: torch.Tensor, ids_restore: torch.Tensor) → torch.Tensor[source]¶

Forward function for PromptMultiheadAttention.

Parameters

x (torch.Tensor) – Mask token features with shape N x L_m x C.
visible_tokens (torch.Tensor) – The visible tokens features from encoder with shape N x L_v x C.
ids_restore (torch.Tensor) – The ids of all tokens in the original image with shape N x L.

Returns

Output features with shape N x L x C.

Return type

torch Tensor

class mmselfsup.models.utils.RelativeLocDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶

Image pre-processor for Relative Location.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format as the model input.

Return type

Tuple[torch.Tensor, Optional[list]]

class mmselfsup.models.utils.RotationPredDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶

Image pre-processor for Relative Location.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format as the model input.

Return type

Tuple[torch.Tensor, Optional[list]]

class mmselfsup.models.utils.SelfSupDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶

Image pre-processor for operations, like normalization and bgr to rgb.

Compared with the mmengine.ImgDataPreprocessor, this module treats each item in inputs of input data as a list, instead of torch.Tensor.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format as the model input.

Return type

Tuple[torch.Tensor, Optional[list]]

class mmselfsup.models.utils.Sobel[source]¶

Sobel layer.

The layer reduces channels from 3 to 2.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Run sobel layer.

class mmselfsup.models.utils.TransformerEncoderLayer(embed_dims: int, num_heads: int, feedforward_channels: int, window_size: Optional[int] = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, num_fcs: int = 2, qkv_bias: bool = True, act_cfg: dict = {'type': 'GELU'}, norm_cfg: dict = {'type': 'LN'}, init_values: float = 0.0, init_cfg: Optional[dict] = None)[source]¶

Implements one encoder layer in Vision Transformer.

This module is the rewritten version of the TransformerEncoderLayer in MMClassification by adding the gamma and relative position bias in Attention module.

Parameters

embed_dims (int) – The feature dimension.
num_heads (int) – Parallel attention heads
feedforward_channels (int) – The hidden dimension for FFNs
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Defaults to 0.
attn_drop_rate (float) – The drop out rate for attention output weights. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
num_fcs (int) – The number of fully-connected layers for FFNs. Defaults to 2.
qkv_bias (bool) – enable bias for qkv if True. Defaults to True.
act_cfg (dict) – The activation config for FFNs. Defaluts to dict(type='GELU').
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
init_values (float) – The init values of gamma. Defaults to 0.0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Forward function.

class mmselfsup.models.utils.TwoNormDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, second_mean: Optional[Sequence[Union[int, float]]] = None, second_std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, rgb_to_bgr: bool = False, non_blocking: Optional[bool] = False)[source]¶

Image pre-processor for CAE, BEiT v1/v2, etc.

Compared with the mmselfsup.SelfSupDataPreprocessor, this module will normalize the prediction image and target image with different normalization parameters.

Parameters

mean (Sequence[float or int], optional) – The pixel mean of image channels. If bgr_to_rgb=True it means the mean value of R, G, B channels. If the length of mean is 1, it means all channels have the same mean value, or the input is a gray image. If it is not specified, images will not be normalized. Defaults None.
std (Sequence[float or int], optional) – The pixel standard deviation of image channels. If bgr_to_rgb=True it means the standard deviation of R, G, B channels. If the length of std is 1, it means all channels have the same standard deviation, or the input is a gray image. If it is not specified, images will not be normalized. Defaults None.
second_mean (Sequence[float or int], optional) – The description is like mean, it can be customized for targe image. Defaults None.
second_std (Sequence[float or int], optional) – The description is like std, it can be customized for targe image. Defaults None.
pad_size_divisor (int) – The size of padded image should be divisible by pad_size_divisor. Defaults to 1.
pad_value (float or int) – The padded pixel value. Defaults to 0.
bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.
rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.
non_blocking (bool) – Whether block current process when transferring data to device.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format as the: model input.

Return type

Tuple[torch.Tensor, Optional[list]]

class mmselfsup.models.utils.VideoDataPreprocessor(mean: Optional[Sequence[Union[int, float]]] = None, std: Optional[Sequence[Union[int, float]]] = None, pad_size_divisor: int = 1, pad_value: Union[float, int] = 0, bgr_to_rgb: bool = False, format_shape: str = 'NCHW')[source]¶

Video pre-processor for operations, like normalization and bgr to rgb conversion .

Compared with the mmaction.ActionDataPreprocessor, this module treats each item in inputs of input data as a list, instead of torch.Tensor.

Parameters

mean (Sequence[float or int, optional) – The pixel mean of channels of images or stacked optical flow. Defaults to None.
std (Sequence[float or int], optional) – The pixel standard deviation of channels of images or stacked optical flow. Defaults to None.
pad_size_divisor (int) – The size of padded image should be divisible by pad_size_divisor. Defaults to 1.
pad_value (float or int) – The padded pixel value. Defaults to 0.
bgr_to_rgb (bool) – Whether to convert image from BGR to RGB. Defaults to False.
format_shape (str) – Format shape of input data. Defaults to 'NCHW'.

forward(data: dict, training: bool = False) → Tuple[List[torch.Tensor], Optional[list]][source]¶

Performs normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

Parameters

data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation. If subclasses override this method, they can perform different preprocessing strategies for training and testing based on the value of training.

Returns

Data in the same format: as the model input.

Return type

Tuple[List[torch.Tensor], Optional[list]]

mmselfsup.models.utils.build_2d_sincos_position_embedding(patches_resolution: Union[int, Sequence[int]], embed_dims: int, temperature: Optional[int] = 10000.0, cls_token: Optional[bool] = False) → torch.Tensor[source]¶

The function is to build position embedding for model to obtain the position information of the image patches.

Parameters

patches_resolution (Union[int, Sequence[int]]) – The resolution of each patch.
embed_dims (int) – The dimension of the embedding vector.
temperature (int, optional) – The temperature parameter. Defaults to 10000.
cls_token (bool, optional) – Whether to concatenate class token. Defaults to False.

Returns

The position embedding vector.

Return type

torch.Tensor

mmselfsup.models.utils.build_clip_model(state_dict: dict, finetune: bool = False, average_targets: int = 1) → torch.nn.modules.module.Module[source]¶

Build the CLIP model.

Parameters

state_dict (dict) – The pretrained state dict.
finetune (bool) – Whether to fineturn the model.
average_targets (bool) – Whether to average the target.

Returns

The CLIP model.

Return type

nn.Module

mmselfsup.structures¶

class mmselfsup.structures.SelfSupDataSample(*, metainfo: Optional[dict] = None, **kwargs)[source]¶

A data structure interface of MMSelfSup. They are used as interfaces between different components.

Meta field:

img_shape (Tuple): The shape of the corresponding input image. Used for visualization.

ori_shape (Tuple): The original shape of the corresponding image. Used for visualization.

img_path (str): The path of original image.

Data field:

gt_label (LabelData): The ground truth label of an image.

sample_idx (InstanceData): The idx of an image in the dataset.

mask (BaseDataElement): Mask used in masks image modeling.

pred_label (LabelData): The predicted label.

pseudo_label (InstanceData): Label used in pretext task, e.g. Relative Location.

Examples

>>> import torch
>>> import numpy as np
>>> from mmengine.structure import InstanceData
>>> from mmselfsup.structures import SelfSupDataSample

>>> data_sample = SelfSupDataSample()
>>> gt_label = LabelData()
>>> gt_label.value = [1]
>>> data_sample.gt_label = gt_label
>>> len(data_sample.gt_label)
1
>>> print(data_sample)
<SelfSupDataSample(
    META INFORMATION
    DATA FIELDS
    gt_label: <InstanceData(
            META INFORMATION
            DATA FIELDS
            value: [1]
        ) at 0x7f15c08f9d10>
    _gt_label: <InstanceData(
            META INFORMATION
            DATA FIELDS
            value: [1]
        ) at 0x7f15c08f9d10>
 ) at 0x7f15c077ef10>

>>> idx = InstanceData()
>>> idx.value = [0]
>>> data_sample = SelfSupDataSample(idx=idx)
>>> assert 'idx' in data_sample

>>> data_sample = SelfSupDataSample()
>>> mask = dict(value=np.random.rand(48, 48))
>>> mask = PixelData(**mask)
>>> data_sample.mask = mask
>>> assert 'mask' in data_sample
>>> assert 'value' in data_sample.mask

>>> data_sample = SelfSupDataSample()
>>> pred_label = dict(pred_label=[3])
>>> pred_label = LabelData(**pred_label)
>>> data_sample.pred_label = pred_label
>>> print(data_sample)
<SelfSupDataSample(
    META INFORMATION
    DATA FIELDS
    _pred_label: <InstanceData(
            META INFORMATION
            DATA FIELDS
            pred_label: [3]
        ) at 0x7f15c06a3990>
    pred_label: <InstanceData(
            META INFORMATION
            DATA FIELDS
            pred_label: [3]
        ) at 0x7f15c06a3990>
) at 0x7f15c07b8bd0>

mmselfsup.visualization¶

class mmselfsup.visualization.SelfSupVisualizer(name: str = 'visualizer', image: Optional[numpy.ndarray] = None, vis_backends: Optional[List[Dict]] = None, save_dir: Optional[str] = None, line_width: Union[int, float] = 3, alpha: Union[int, float] = 0.8)[source]¶

MMSelfSup Visualizer.

Parameters

name (str) – Name of the instance. Defaults to ‘visualizer’.
image (np.ndarray, optional) – the origin image to draw. The format should be RGB. Defaults to None.
vis_backends (list, optional) – Visual backend config list. Defaults to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
line_width (int, float) – The linewidth of lines. Defaults to 3.
alpha (int, float) – The transparency of boxes or mask. Defaults to 0.8.

Examples

>>> import numpy as np
>>> import torch
>>> from mmengine.structures import InstanceData
>>> from mmselfsup.structures import SelfSupDataSample
>>> from mmselfsup.visualization import SelfSupVisualizer

>>> selfsup_visualizer = SelfSupVisualizer()
>>> image = np.random.randint(0, 256,
...                     size=(10, 12, 3)).astype('uint8')
>>> pseudo_label = InstanceData()
>>> pseudo_label.patch_box = torch.Tensor([[1, 2, 2, 5]])
>>> gt_selfsup_data_sample = SelfSupDataSample()
>>> gt_selfsup_data_sample.pseudo_label = pseudo_label
>>> selfsup_visualizer.add_datasample('image', image,
...                         gt_selfsup_data_sample)
>>> selfsup_visualizer.add_datasample(
...                       'image', image, gt_selfsup_data_sample,
...                        out_file='out_file.jpg')
>>> selfsup_visualizer.add_datasample(
...                        'image', image, gt_selfsup_data_sample,
...                         show=True)
>>> pseudo_label = InstanceData()
>>> pseudo_label.patch_box = torch.Tensor([[1, 2, 2, 5]])
>>> pred_selfsup_data_sample = SelfSupDataSample()
>>> pred_selfsup_data_sample.pseudo_label = pseudo_label
>>> selfsup_visualizer.add_datasample('image', image,
...                         gt_selfsup_data_sample,
...                         pred_selfsup_data_sample)

add_datasample(name: str, image: numpy.ndarray, gt_sample: Optional[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample] = None, pred_sample: Optional[mmselfsup.structures.selfsup_data_sample.SelfSupDataSample] = None, draw_gt: bool = True, draw_pred: bool = True, show: bool = False, wait_time: float = 0, out_file: Optional[str] = None, step: int = 0) → None[source]¶

Draw datasample and save to all backends.

If GT and prediction are plotted at the same time, they are displayed in a stitched image where the left image is the ground truth and the right image is the prediction.

If show is True, all storage backends are ignored, and the images will be displayed in a local window.

If out_file is specified, the drawn image will be saved to out_file. t is usually used when the display is not available.

Parameters

name (str) – The image identifier.
image (np.ndarray) – The image to draw.
gt_sample (SelfSupDataSample, optional) – GT SelfSupDataSample. Defaults to None.
pred_sample (SelfSupDataSample, optional) – Prediction SelfSupDataSample. Defaults to None.
draw_gt (bool) – Whether to draw GT SelfSupDataSample. Default to True.
draw_pred (bool) – Whether to draw Prediction SelfSupDataSample. Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (float) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
step (int) – Global step value to record. Defaults to 0.

mmselfsup.utils¶

class mmselfsup.utils.AliasMethod(probs: torch.Tensor)[source]¶

The alias method for sampling.

From: https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/

Parameters: probs (torch.Tensor) – Sampling probabilities.

draw(N: int) → None[source]¶

Draw N samples from multinomial.

Parameters: N (int) – Number of samples.
Returns: Samples.
Return type: torch.Tensor

mmselfsup.utils.batch_shuffle_ddp(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Batch shuffle, for making use of BatchNorm.

Parameters

x (torch.Tensor) – Data in each GPU.

Returns

Output of shuffle operation.

x_gather[idx_this]: Shuffled data.
idx_unshuffle: Index for restoring.

Return type

Tuple[torch.Tensor, torch.Tensor]

mmselfsup.utils.batch_unshuffle_ddp(x: torch.Tensor, idx_unshuffle: torch.Tensor) → torch.Tensor[source]¶

Undo batch shuffle.

Parameters

x (torch.Tensor) – Data in each GPU.
idx_unshuffle (torch.Tensor) – Index for restoring.

Returns

Output of unshuffle operation.

Return type

torch.Tensor

mmselfsup.utils.collect_env()[source]¶: Collect the information of the running environments.

mmselfsup.utils.concat_all_gather(tensor: torch.Tensor) → torch.Tensor[source]¶

Performs all_gather operation on the provided tensors.

Parameters: tensor (torch.Tensor) – Tensor to be broadcast from current process.
Returns: The concatnated tensor.
Return type: torch.Tensor

mmselfsup.utils.dist_forward_collect(func: object, data_loader: torch.utils.data.dataloader.DataLoader, length: int) → dict[source]¶

Forward and collect network outputs in a distributed manner.

This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.

Parameters

func (function) – The function to process data.
data_loader (DataLoader) – the torch DataLoader to yield data.
length (int) – Expected length of output arrays.

Returns

The collected outputs.

Return type

Dict[str, torch.Tensor]

mmselfsup.utils.distributed_sinkhorn(out: torch.Tensor, sinkhorn_iterations: int, world_size: int, epsilon: float) → torch.Tensor[source]¶

Apply the distributed sinknorn optimization on the scores matrix to find the assignments.

Parameters

out (torch.Tensor) – The scores matrix
sinkhorn_iterations (int) – Number of iterations in Sinkhorn-Knopp algorithm.
world_size (int) – The world size of the process group.
epsilon (float) – regularization parameter for Sinkhorn-Knopp algorithm.

Returns

Output of sinkhorn algorithm.

Return type

torch.Tensor

mmselfsup.utils.get_model(model: torch.nn.modules.module.Module) → mmengine.model.base_model.base_model.BaseModel[source]¶

Get model if the input model is a model wrapper.

Parameters: model (nn.Module) – A model may be a model wrapper.
Returns: The model without model wrapper.
Return type: BaseModel

mmselfsup.utils.nondist_forward_collect(func: object, data_loader: torch.utils.data.dataloader.DataLoader, length: int) → dict[source]¶

Forward and collect network outputs.

This function performs forward propagation and collects outputs. It can be used to collect results, features, losses, etc.

Parameters

func (function) – The function to process data.
data_loader (DataLoader) – the torch DataLoader to yield data.
length (int) – Expected length of output arrays.

Returns

The concatenated outputs.

Return type

Dict[str, torch.Tensor]

mmselfsup.utils.register_all_modules(init_default_scope: bool = True) → None[source]¶

Register all modules in mmselfsup into the registries.

Parameters: init_default_scope (bool) – Whether initialize the mmselfsup default scope. When init_default_scope=True, the global default scope will be set to mmselfsup, and all registries will build modules from mmselfsup’s registry node. To understand more about the registry, please refer to https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/registry.md Defaults to True.