Shortcuts

PixMIM

TL;DR

PixMIM can seamlessly replace MAE as a stronger baseline, with negligible computational overhead.

Abstract

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pretrained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network’s focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework.

Models and Benchmarks

Here, we report the results of the model on ImageNet, the details are below:

Algorithm Backbone Epoch Batch Size Results (Top-1 %) Links
Linear Probing Fine-tuning Pretrain Linear Probing Fine-tuning
PixMIM ViT-base 300 4096 63.3 83.1 config | model | log config | model | log config | model | log
PixMIM ViT-base 800 4096 67.5 83.5 config | model | log config | model | log config | model | log

Pre-train and Evaluation

Pre-train

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/slurm_train.sh $partition $job_name configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py --amp

If you use a single machine without any cluster management software

bash tools/dist_train.sh configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py 8 --amp

Linear Probing

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

Fine-tuning

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

Detection and Segmentation

If you want to evaluate your model on detection or segmentation task, we provide a script to convert the model keys from MMClassification style to timm style.

cd $MMSELFSUP
python tools/model_converters/mmcls2timm.py $src_ckpt $dst_ckpt

Then, using this converted ckpt, you can evaluate your model on detection task, following Detectron2, and on semantic segmentation task, following this project. Besides, using the unconverted ckpt, you can use MMSegmentation to evaluate your model.

Citation

@article{PixMIM,
  author  = {Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin},
  journal = {arXiv:2303.02416},
  title   = {PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling},
  year    = {2023},
}
Read the Docs v: stable
Versions
latest
stable
1.x
dev-1.x
0.x
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.