3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection

Accepted by CVPR 2021

3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel semi-supervised method for 3D object detection applicable to both indoor and outdoor scenes. We leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism, inspired by FixMatch. We set confidence thresholds based upon the predicted objectness and class probability to filter low-quality pseudo-labels. While effective, we observe that these two measures do not sufficiently capture localization quality. We therefore propose to use the estimated 3D IoU as a localization metric and set category-aware self-adjusted thresholds to filter poorly localized proposals. We adopt VoteNet as our backbone detector on indoor datasets while we use PV-RCNN on the autonomous driving dataset, KITTI. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins under all label ratios (including fully labeled setting). For example, when training using only 10% labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art. On KITTI, we are the first to demonstrate semi-supervised 3D object detection and our method surpasses a fully supervised baseline from 1.8% to 7.6% under different label ratios and categories.

[ Paper ]     [ Code and pretrained models ]    



Performance: Comparison with VoteNet and SESS on ScanNet val set and SUN RGB-D val set under different ratios of labeled data

Performance: Comparison with PV-RCNN on KITTI val set under different ratios of labeled data

Visualization: Qualitative results on ScanNet, with 10% labeled data. Here green bounding boxes have an IoU >= 0.25 while red bounding boxes are with an IoU < 0.25

Visualization: Qualitative results on SUNRGB-D, with 5% labeled data


Latest version (April 7, 2021): arXiv:2012.04355 in cs.CV or here.


1 Stanford University            2 Tsinghua University            3 NVIDIA             

* stands for equal contribution.


title={3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection},
author={Wang, He and Cong, Yezhen and Litany, Or and Gao, Yue and Guibas, Leonidas J},
journal={arXiv preprint arXiv:2012.04355},


This research is supported by a grant from the SAIL-Toyota Center for AI Research, NSF grant CHS-1528025, a Vannevar Bush Faculty fellowship, a TUM/IAS Hans Fischer Senior Fellowship, and gifts from the Adobe, Amazon AWS, and Snap corporations.


If you have any questions, please feel free to contact Yezhen Cong at cyz17_at_mails.tsinghua.edu.cn and He Wang at hewang_at_stanford.edu