基于SE-DR-Res2Block的聲紋識別方法

李平; 高清源; 夏宇; 張小勇; 曹毅

doi:10.13374/j.issn2095-9389.2022.09.19.001

摘要: 針對聲紋識別領域中基于傳統Res2Net模型特征表達能力不足、泛化能力不強的問題，提出了一種結合稠密連接與殘差連接的特征提取模塊SE-DR-Res2Block(Sequeeze and excitation with dense and residual connected Res2Block). 首先，介紹了應用傳統Res2Block的ECAPA-TDNN(Emphasized channel attention, propagation and aggregation in time delay neural network)網絡結構和稠密連接及其工作原理；然后，為實現更高效的特征提取，采用稠密連接進一步實現特征的充分挖掘，基于SE-Block(Squeeze and excitation block)將殘差連接和稠密連接相結合，提出了一種更高效的特征提取模塊SE-DR-Res2Net. 該模塊以一種更細粒化的方式獲得不同生長速率和多種感受野的組合，從而獲取多尺度的特征表達組合并最大限度上實現特征重用，以實現對不同層特征的信息進行有效提取，獲取更多尺度的特征信息；最后，為驗證該模塊的有效性，基于不同網絡模型采用SE-Res2Block(Sequeeze and excitation Res2Block)、FULL-SE-Res2Block(Fully connected sequeeze and excitation Res2Block)、SE-DR-Res2Block、FULL-SE-DR-Res2Block(Fully connected sequeeze and excitation with dense and residual connected Res2Block)，分別在Voxceleb1和SITW(Speakers in the wild)數據集開展了聲紋識別的研究. 實驗結果表明，采用SE-DR-Res2Block的ECAPA-TDNN網絡模型，最佳等錯誤率分別達到2.24%和3.65%，其驗證了該模塊的特征表達能力，并且在不同測試集上的結果也驗證了其具有良好的泛化能力.

Abstract: Aiming at the problems of insufficient feature expression ability and weak generalization ability of the traditional Res2Net model in the field of voice print recognition, this paper proposes a feature extraction module known as the SE-DR-Res2Block, which combinedly uses dense connection and residual connection. The combination of low-semantic features with spatial information characteristics allows focusing more on detailed information and high-semantic information that concentrates on global information as well as abstract features. This can compensate for the loss of some detailed information caused by abstraction. First, the feature of each layer in the dense connection structure is derived from the feature output of all previous layers to realize feature reuse. Second, the structure and working principle of the ECAPA-TDNN network using traditional Res2Block is introduced. To achieve more efficient feature extraction, the dense connection is used to further realize full feature mining. Based on SE-block, a more efficient feature extraction module, SE-DR-Res2Net, is proposed by combining the residual join and dense links. As compared to the traditional SE-Block structures, the convolutional layers are used here instead of fully connected layers. Because they not only reduce the number of parameters needed for training but also allow weight sharing, thereby reducing overfitting. Therefore, effective extraction of feature information from different layers is essential for obtaining multiscale expression as well as maximizing the reuse of features. During the collection of more scale-specific feature information, a large number of dense structures can lead to a dramatic increase in parameters and computational complexity. By using partial residual structures instead of dense structures, we can effectively prevent the dramatic increase in parameter quantity while maintaining the performance to a certain extent. Finally, to verify the effectiveness of the module, SE-Res2block, Full-SE-Res2block, SE-DR-Res2block, and Full-SE-DR-Res2block are adopted based on the different network models. Voxceleb1 and SITW (speakers in the wild) datasets were used for Voxceleb1 and SITW, respectively. The performance comparison of Res2Net-50 models with different modules on the Voxceleb1 dataset shows that SE-DR-Res2Net-50 achieves the best equal error rate of 3.51%, which also validates the adaptability of this module on different networks. The usage of different modules on different networks, as well as experiments and analyses conducted on different datasets, were compared. The experimental results showed that the optimal equal error rates of the ECAPA-TDNN network model using SE-DR-Res2block had reached 2.24% and 3.65%, respectively. This verifies the feature expression ability of this module, and the corresponding results based on different test data sets also confirm its excellent generalization ability.

基于SE-DR-Res2Block的聲紋識別方法

Voiceprint recognition method based on SE-DR-Res2Block