Abstract:
Aiming at the problems of insufficient feature expression ability and weak generalization ability of the traditional Res2Net model in the field of voice print recognition, this paper proposed a feature extraction module SE-DR-Res2Block, which combined dense connection and residual connection. Firstly, the feature of each layer in dense connection structure is derived from the feature output of all previous layers to realize feature reuse. Secondly, the structure and working principle of ECAPA-TDNN network using traditional Res2Block are introduced. Then, in order to achieve more efficient feature extraction, dense join is used to further realize full feature mining. Based on SE-block, a more efficient feature extraction module SE-DR-Res2Net is proposed by combining residual join and dense link. The module obtains the combination of different growth rates and multiple receptive fields in a more granular way, so as to obtain the multi-scale feature expression combination and maximize the reuse of features, so as to realize the effective extraction of feature information from different layers. Finally, to verify the effectiveness of the module, SE-Res2block, Full-SE-RES2block, SE-DR-Res2block, and Full-SE-DR-Res2block are adopted based on different network models. Voxceleb1 and SITW datasets were used for Voxceleb1 and SITW. The experimental results show that the optimal equal error rates of the ECAPA-TDNN network model using SE-DR-Res2block reach 2.24% and 3.65% respectively, which verifies the feature expression ability of the module, and the results on different test sets also verify that it has good generalization ability.