-
摘要: 隨著獲取點云數據成本下降以及GPU算力的提高,眾多三維視覺場景如自動駕駛、工業控制、MR/XR對三維語義分割的需求日益旺盛,這進一步推動了深度學習模型在三維點云語義分割任務中的發展。近期,深度學習模型在網絡架構上持續創新,如RandLA-Net 和Point Transformer,并突破性地以更低的計算成本提高了分割準確率,但已有的三維點云語義分割綜述介紹的研究工作包含大量早期以及被舍棄的方法,沒有系統地整理這些新型高效的方法,不能很好地體現研究現狀。此外,這部分綜述以輸入網絡的不同數據類型分類各點云語義分割方法,不能有效地體現各方法的演進關系,也不利于對比不同方法的分割性能。針對以上問題,本文面向近3年的研究成果和最新的研究進展,重點歸納了三維點云語義分割中基于不同網絡架構的方法、面臨的挑戰及潛在研究方向,并從3個層面對三維點云語義分割進行了系統地綜述。通過本文,讀者可以較系統地了解三維點云語義分割的數據獲取方式、常見數據集及模型的評價指標,對比基于不同網絡架構的三維點云語義分割方法的發展過程、分割性能和優缺點,并進一步認識三維點云語義分割現存的挑戰和潛在的研究方向。Abstract: Decrease in the cost of acquiring 3D point cloud data coupled with the rapid advancements in GPU computing power have resulted in an increased demand for 3D point cloud semantic segmentation in numerous 3D visual applications, including but not limited to autonomous driving, industrial control, and MR/XR, which further advances the development of deep learning methods in 3D point cloud semantic segmentation. Recently, many novel deep learning network architectures, such as RandLA-Net and Point Transformer, have been proposed and have achieved notable improvements in semantic segmentation accuracy while decreasing the computational load. However, previous research on 3D point cloud semantic segmentation methods has focused primarily on relatively early works, whose approaches have been gradually abandoned over the years and cannot accurately reflect the current research status. Moreover, the existing methods have been categorized based on their input data types, making it difficult to compare the segmentation performance of different techniques and not providing a comprehensive view of the relationship between methods using different network architectures. Therefore, this paper reviews the mainstream 3D semantic segmentation methods developed in the last three years using different deep learning network architectures and is organized into three levels. First, the two principal 3D point cloud data acquisition methods, including their customary datasets and metrics to evaluate model performance, are introduced. Second, a systematic review of 3D semantic segmentation methods based on different network architectures is organized, followed by a statistical analysis of the evaluation of performance between different models on two 3D segmentation datasets—S3DIS and ScanNet. The analysis of model performance on these two commonly used datasets includes model structure relevance, strengths, and limitations. Finally, an insightful discussion of the remaining methodological and application challenges and potential research directions is provided. This paper offers an extensive overview of the recent three-year research progress in 3D point cloud semantic segmentation and summarizes various network architecture pipelines, elucidates their fundamental operations, compares the model performance across multiple architectures, discusses their notable strengths and limitations, most importantly, concludes the current challenges and promising research directions for future investigations. Furthermore, this paper enables researchers to effortlessly identify the relevant research and research hotspots among different 3D point cloud semantic segmentation methods based on the analyses presented and aims to update the reviews on 3D point cloud semantic segmentation methods with a better viewpoint and highlight key properties and contributions of proposed methods, providing promising research directions for the main challenges.
-
Key words:
- 3D vision /
- point cloud /
- semantic segmentation /
- deep learning /
- network framework
-
圖 3 縮放點積注意力和單層Transformer 的結構(其中注意力模塊的輸入分為查詢Q, 鍵K和值 V, 并得到帶權重的輸出. 最右側虛線框內為多頭自注意力的結構). (a) 縮放點積注意力; (b) 單層Transformer
Figure 3. Structures of scaled dot-product attention and single-layer Transformer encoder (the attention module has three inputs: a query vector Q, key vector K, value vector V, and weighted output. The structure of multi-head self-attention is shown inside the right-most dashed box): (a) scaled dot-product attention; (b) single-layer Transformer encoder
圖 4 不同網絡架構的基本過程示例(其中MSA為多頭自注意力, Add為殘差連接, Norm為正則化, FFN為前饋網絡). (a) 卷積網絡; (b) 圖卷積網絡; (c) 注意力圖網絡; (d) Transformer
Figure 4. Illustrations of fundamental operations for different network structures (Note: MSA, Add, Norm, and FFN denote multi-head self-attention, residual connection, normalization, and feed-forward networks, respectively): (a) convolution network; (b) graph convolution network; (c) attention graph network; (d) transformer
表 1 點云語義分割常用數據集
Table 1. Popular point cloud semantic segmentation datasets
Dataset name
Dataset type
Sensors
Scene type# scenes
# classesYear S3DIS[2] LiDAR point clouds Matterport camera indoor 272 13 2016 Semantic3D[14] LiDAR point clouds Terrestrial laser scanners outdoor 30 8 2017 SemanticKITTI[15] LiDAR point clouds Mobile laser scanners outdoor 43552 28 2019 ScanNet[17] RGB-D images RGB-D camera indoor 1513 21 2018 Note: “#” represents “the number of”. 表 2 不同網絡架構的點云語義分割方法在S3DIS和ScanNet上的評估性能對比
Table 2. Performance evaluation of different semantic segmentation architecture methods on the S3DIS and ScanNet datasets
Method Input Architecture S3DIS batch size S3DIS number of batch points S3DIS 6-fold mIoU/% S3DIS tested on Area5 mIoU/% ScanNet test set overall accuracy/% ScanNet test set mIoU/% Pointnet[58] points MLP 47.60 41.1 73.9 14.69 Pointnet++[64] points 54.50 51.5 84.5 38.28 MinkowskiNet[20] voxels CNN 65.35 72.1 PointConv[3] points 50.34 55.6 PointWeb[23] points 16 66.70 60.28 85.9 A-CNN[24] points 85.4 KPConv[27] points 70.60 67.1 68.4 PAConv[40] points 4096 66.58 HDGCN[18] points GNN 66.85 59.33 TGNet[19] points 16 58.70 66.2 DGCNN[4] points 12 4096 56.10 ResGCN-28[26] points 60.00 SegGCN[32] points 8 8192 63.60 58.9 SPH3D-GCN[35] points 16 8192 68.90 59.5 61 GACNet[21] points Attention 16 62.85 RandLA-Net[31] points 70.00 AGCN[34] points 4096 56.63 PAN[37] points 32 66.30 86.7 42.1 PATs[22] points Pure Transformer 60.1 PCT[42] points 61.33 PVT[43] voxels+
points4096 68.21 Point Transformer[5] points 8 73.50 70.4 LFT-Net[46] points 65.2 Fast Point Transformer[47] voxels 70.1 72.1 Point Transformer V2[56] points 71.6 75.2 Stratified Transformer[48] points Hybrid Transformer 16 72 MinkNet18+Segment-Fusion[67] points 65.3 259luxu-164 -
參考文獻
[1] Riemenschneider H, Bódis-Szomorú A, Weissenberg J, et al. Learning where to classify in multi-view semantic segmentation // European Conference on Computer Vision. Zurich, 2014: 516 [2] Armeni I, Sener O, Zamir A R, et al. 3D semantic parsing of large-scale indoor spaces // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 2016: 1534 [3] Wu W X, Qi Z A, Li F X. PointConv: deep convolutional networks on 3D point clouds // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2020: 9613 [4] Wang Y, Sun Y B, Liu Z W, et al. Dynamic graph CNN for learning on point clouds. ACM Trans Graph, 2019, 38(5): 1 [5] Zhao H S, Jiang L, Jia J Y, et al. Point transformer // 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, 2021: 16259 [6] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504 doi: 10.1126/science.1127647 [7] Guo Y L, Wang H Y, Hu Q Y, et al. Deep learning for 3D point clouds: A survey. IEEE Trans Pattern Anal Mach Intell, 2020, 43(12): 4338 [8] Xie Y X, Tian J J, Zhu X X. Linking points with labels in 3D: A review of point cloud semantic segmentation. IEEE Geosci Remote Sens Mag, 2020, 8(4): 38 doi: 10.1109/MGRS.2019.2937630 [9] He Y, Yu H S, Liu X Y, et al. Deep learning based 3D segmentation: A survey [J/OL]. arXiv Preprint (2021-3-10) [2022-12-17]. https://arxiv.org/abs/2103.05423&p;shy; [10] Lahoud J, Cao J L, Khan F S, H, et al. 3D Vision with Transformers: A Survey [J/OL]. arXiv preprint (2022-8-8) [2022-12-17]. https://arxiv.org/abs/2208.04309 [11] Lu D N, Xie Q, Wei M Q, et al. Transformers in 3D point clouds: A survey [J/OL]. arXiv preprint (2017-5-24) [2022-12-17]. https://arxiv.org/abs/2205.07417 [12] Zeng J H, Wang D C, Chen P. A survey on transformers for point cloud processing: An updated overview. IEEE Access, 2022, 10: 86510 doi: 10.1109/ACCESS.2022.3198999 [13] Gao B, Pan Y C, Li C K, et al. Are we hungry for 3D LiDAR data for semantic segmentation? A survey of datasets and methods. IEEE Trans Intell Transp Syst, 2021, 23(7): 6063 [14] Hackel T, Savinov N, Ladicky L, et al. Semantic3D. net: A new large-scale point cloud classification benchmark [J/OL]. arXiv preprint (2017-5-24) [2022-12-17]. https://arxiv.org/abs/1704.03847 [15] Behley J, Garbade M, Milioto A, et al. SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, 2019: 9296 [16] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite // 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, 2012: 3354 [17] Dai A, Chang A X, Savva M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scenes // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, 2017: 5828 [18] Liang Z D, Yang M, Deng L Y, et al. Hierarchical depthwise graph convolutional neural network for 3D semantic segmentation of point clouds // 2019 International Conference on Robotics and Automation (ICRA). Montreal, 2019: 8152 [19] Li Y, Ma L F, Zhong Z L, et al. TGNet: Geometric graph CNN on 3-D point cloud segmentation. IEEE Trans Geosci Remote Sens, 2019, 58(5): 3588 [20] Choy C, Gwak J Y, Savarese S. 4d spatio-temporal convnets: Minkowski convolutional neural networks // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 3075 [21] Wang L, Huang Y C, Hou Y L, et al. Graph attention convolution for point cloud semantic segmentation // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2019: 10296 [22] Yang J C, Zhang Q, Ni B B, et al. Modeling point clouds with self-attention and gumbel subset sampling // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: 2019: 3323 [23] Zhao H S, Jiang L, Fu C W, et al. PointWeb: enhancing local neighborhood features for point cloud processing // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2019: 5565 [24] Komarichev A, Zhong Z C, Hua J. A-CNN: Annularly convolutional neural networks on point clouds // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2019: 7421 [25] Meng H Y, Gao L, Lai Y K, et al. VV-net: Voxel VAE net with group convolutions for point cloud segmentation // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, 2019: 8500 [26] Li G H, Müller M, Thabet A, et al. DeepGCNs: can GCNs go as deep as CNNs? // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, 2019: 9267 [27] Thomas H, Qi C R, Deschaud J E, et al. KPConv: flexible and deformable convolution for point clouds // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, 2019: 6411 [28] Milioto A, Vizzo I, Behley J, et al. RangeNet++: fast and accurate LiDAR semantic segmentation // 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Macau, 2019: 4213 [29] Ma Y N, Guo Y L, Liu H, et al. Global context reasoning for semantic segmentation of 3D point clouds // 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass, 2020: 2931 [30] Shi H Y, Lin G S, Wang H, et al. SpSequenceNet: semantic segmentation network on 4D point clouds // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, 2020: 4574 [31] Hu Q Y, Yang B, Xie L H, et al. Randla-net: Efficient semantic segmentation of large-scale point clouds // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2020: 11108 [32] Lei H, Akhtar N, Mian A. SegGCN: efficient 3D point cloud segmentation with fuzzy spherical kernel // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, 2020: 11611 [33] Xu C F, Wu B C, Wang Z N, et al. SqueezeSegV3: Spatially-adaptive convolution for efficient point-cloud segmentation // European Conference on Computer Vision. Glasgow, 2020: 1 [34] Xie Z Y, Chen J Z, Peng B. Point clouds learning with attention-based graph convolution networks. Neurocomputing, 2020, 402: 245 doi: 10.1016/j.neucom.2020.03.086 [35] Lei H, Akhtar N, Mian A. Spherical kernel for efficient graph convolution on 3D point clouds. IEEE Trans Pattern Anal Mach Intell, 2020, 43(10): 3664 [36] Wen X, Han Z Z, Youk G, et al. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention // Proceedings of the 28th ACM International Conference on Multimedia. Seattle, 2020: 1661 [37] Feng M T, Zhang L, Lin X F, et al. Point attention network for semantic segmentation of 3D point clouds. Pattern Recognit, 2020, 107: 107446 doi: 10.1016/j.patcog.2020.107446 [38] Zhang G G, Ma Q H, Jiao L C, et al. AttAN: Attention adversarial networks for 3D point cloud semantic segmentation // Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. Yokohama, 2020: 789 [39] Huang H, Fang Y. Adaptive wavelet transformer network for 3D shape representation learning // International Conference on Learning Representations. Hefei, 2022: 1 [40] Xu M T, Ding R Y, Zhao H S, et al. PAConv: position adaptive convolution with dynamic kernel assembling on point clouds // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, 2021: 3173 [41] Fan H H, Yang Y, Kankanhalli M. Point 4D transformer networks for spatio-temporal modeling in point cloud videos // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, 2021: 14204 [42] Guo M H, Cai J X, Liu Z N, et al. PCT: Point cloud transformer. Comp Visual Media, 2021, 7(2): 187 doi: 10.1007/s41095-021-0229-5 [43] Zhang C, Wan H C, Shen X Y, et al. PVT: Point-voxel transformer for point cloud learning [J/OL]. arXiv preprint (2022-5-25) [2022-12-17].https://arxiv.org/abs/2108.06076 [44] Wan J, Xie Z, Xu Y Y, et al. DGANet: A dilated graph attention-based network for local feature extraction on 3D point clouds. Remote Sens, 2021, 13(17): 3484 doi: 10.3390/rs13173484 [45] Wei Y M, Liu H, Xie T T, et al. Spatial-temporal transformer for 3D point cloud sequences // 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, 2022: 1171 [46] Gao Y B, Liu X B, Li J, et al. LFT-net: Local feature transformer network for point clouds analysis. IEEE Trans Intell Transp Syst, 2023, 24(2): 2158 [47] Park C, Jeong Y, Cho M, et al. Fast point transformer // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans 2022: 16949 [48] Lai X, Liu J H, Jiang L, et al. Stratified transformer for 3D point cloud segmentation // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022: 8500 [49] Xu S J, Wan R, Ye M S, et al. Sparse cross-scale attention network for efficient LiDAR panoptic segmentation // Proceedings of the AAAI Conference on Artificial Intelligence. Online, 2022: 2920 [50] Yu X M, Tang L L, Rao Y M, et al. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022: 19313 [51] Fu K X, YuanM Z, Wang M N. Point-McBert: A Multi-choice self-supervised framework for point cloud pre-training [J/OL]. arXiv preprint (2022-8-15) [2022-12-17]. https://arxiv.org/abs/2207.13226 [52] Zeng Z Y, Xu Y Y, Xie Z, et al. RG-GCN: A random graph based on graph convolution network for point cloud semantic segmentation. Remote Sens, 2022, 14: 4055 doi: 10.3390/rs14164055 [53] Wu Y X, Liao K L, Chen J T, et al. D-former: A u-shaped dilated transformer for 3d medical image segmentation. Neural Comput Appl, 2022, 35: 1931 [54] Qian G C, Zhang X D, Hamdi A, et al. Improving standard transformer models for 3D point cloud understanding with image pretraining [J/OL]. arXiv preprint (2022-11-22) [2022-12-17]. https://arxiv.org/abs/2208.12259 [55] Yan X, Gao J T, Zheng C D, et al. 2DPASS: 2D priors assisted semantic segmentation on LiDAR point clouds // European Conference on Computer Vision. Tel Aviv, 2022: 677 [56] Wu X Y, Lao Y X, Jiang L, et al. Point transformer V2: Grouped vector attention and partition-based pooling [J/OL]. arXiv preprint (2022-10-11) [2022-12-17]. https://arxiv.org/abs/2210.05666 [57] Mousavian A, Pirsiavash H, Košecká J. Joint semantic segmentation and depth estimation with deep convolutional networks // 2016 Fourth International Conference on 3D Vision (3DV). Stanford, 2016: 611 [58] Charles R Q, Hao S, Mo K C, et al. PointNet: deep learning on point sets for 3D classification and segmentation // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, 2017: 652 [59] Wu B C, Zhou X Y, Zhao S C, et al. SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud // 2019 International Conference on Robotics and Automation (ICRA). New York, 2019: 4376 [60] Wu B C, Wan A, Yue X Y, et al. SqueezeSeg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud // 2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, 2018: 1887 [61] Xu Q G, Sun X D, Wu C Y, et al. Grid-GCN for fast and scalable point cloud learning // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, 2020: 5661 [62] Lei H, Akhtar N, Mian A. Octree guided CNN with spherical kernels for 3D point clouds // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2019: 9631 [63] Liang Z D, Yang M, Li H, et al. 3D instance embedding learning with a structure-aware loss function for point cloud segmentation. IEEE Robotics Autom Lett, 2020, 5(3): 4915 doi: 10.1109/LRA.2020.3004802 [64] Qi C R, Yi L, Su H, et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space // Advances in Neural Information Processing Systems. Long Beach, 2017: 1 [65] Liu J W, Liu J W, Luo X L. Research progress in attention mechanism in deep learning. Chin J Eng, 2021, 43(11): 1499劉建偉, 劉俊文, 羅雄麟. 深度學習中注意力機制研究進展. 工程科學學報, 2021, 43(11):1499 [66] Guo M H, Xu T X, Liu J J, et al. Attention mechanisms in computer vision: A survey. Comput Vis Media, 2022, 8(3): 331 doi: 10.1007/s41095-022-0271-y [67] Thyagharajan A, Ummenhofer B, Laddha P, et al. Segment-fusion: Hierarchical context fusion for robust 3D semantic segmentation // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022: 1236 [68] Li R H, Li X Z, Heng P A, et al. PointAugment: an auto-augmentation framework for point cloud classification // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, 2020: 6378 [69] Xiao A R, Huang J X, Guan D Y, et al. Unsupervised representation learning for point clouds: A survey [J/OL]. arXiv preprint (2022-6-5) [2022-12-17]. https://arxiv.org/abs/2202.13589 [70] Liu M H, Zhou Y, Qi C R, et al. LESS: Label-efficient semantic segmentation for LiDAR point clouds // European Conference on Computer Vision. Tel Aviv, 2022: 70 [71] Jhaldiyal A, Chaudhary N. Semantic segmentation of 3D LiDAR data using deep learning: A review of projection-based methods. Appl Intell, 2023, 53(6): 6844 doi: 10.1007/s10489-022-03930-5 [72] Guo M H, Lu C Z, Hou Q B, et al. SegNeXt: Rethinking convolutional attention design for semantic segmentation [J/OL]. arXiv preprint (2022-9-18) [2023-12-17]. https://arxiv.org/abs/2209.08575 [73] Qian G C, Li Y C, Peng H W, et al. PointNeXt: Revisiting PointNet++ with improved training and scaling strategies [J/OL]. arXiv preprint (2022-10-12) [2022-12-17]. https://arxiv.org/abs/2206.04670 [74] Xie X, Bai L, Huang X M. Real-time LiDAR point cloud semantic segmentation for autonomous driving. Electronics, 2021, 11(1): 11 doi: 10.3390/electronics11010011 -