PixMIM¶

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

TL;DR¶

PixMIM can seamlessly replace MAE as a stronger baseline, with negligible computational overhead.

Abstract¶

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pretrained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network’s focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework.

Models and Benchmarks¶

Here, we report the results of the model on ImageNet, the details are below:

Algorithm	Backbone	Epoch	Batch Size	Results (Top-1 %)		Links
Algorithm	Backbone	Epoch	Batch Size	Linear Probing	Fine-tuning	Pretrain	Linear Probing	Fine-tuning
PixMIM	ViT-base	300	4096	63.3	83.1	config \| model \| log	config \| model \| log	config \| model \| log
PixMIM	ViT-base	800	4096	67.5	83.5	config \| model \| log	config \| model \| log	config \| model \| log

Pre-train and Evaluation¶

Pre-train¶

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/slurm_train.sh $partition $job_name configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py --amp

If you use a single machine without any cluster management software

bash tools/dist_train.sh configs/selfsup/pixmim/pixmim_vit-base-p16_8xb512-amp-coslr-300e_in1k.py 8 --amp

Linear Probing¶

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_linear-8xb2048-coslr-torchvision-transform-90e_in1k.py $pretrained_model --amp

Fine-tuning¶

If you use a cluster managed by Slurm

# all of our experiments can be run on a single machine, with 8 A100 GPUs
bash tools/benchmarks/classification/mim_slurm_train.sh $partition configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

If you use a single machine without any cluster management software

GPUS=8 bash tools/benchmarks/classification/mim_dist_train.sh configs/selfsup/pixmim/classification/vit-base-p16_ft-8xb128-coslr-100e_in1k.py $pretrained_model --amp

Detection and Segmentation¶

If you want to evaluate your model on detection or segmentation task, we provide a script to convert the model keys from MMClassification style to timm style.

cd $MMSELFSUP
python tools/model_converters/mmcls2timm.py $src_ckpt $dst_ckpt

Then, using this converted ckpt, you can evaluate your model on detection task, following Detectron2， and on semantic segmentation task, following this project. Besides, using the unconverted ckpt, you can use MMSegmentation to evaluate your model.

Citation¶

@article{PixMIM,
  author  = {Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin},
  journal = {arXiv:2303.02416},
  title   = {PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling},
  year    = {2023},
}