<th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
<progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
<th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
<progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
<th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
<progress id="5nh9l"><noframes id="5nh9l">
  • 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中國科技論文統計源期刊
  • 中國科學引文數據庫來源期刊

留言板

尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內容
驗證碼

基于DL-T及遷移學習的語音識別研究

張威 劉晨 費鴻博 李巍 俞經虎 曹毅

張威, 劉晨, 費鴻博, 李巍, 俞經虎, 曹毅. 基于DL-T及遷移學習的語音識別研究[J]. 工程科學學報, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
引用本文: 張威, 劉晨, 費鴻博, 李巍, 俞經虎, 曹毅. 基于DL-T及遷移學習的語音識別研究[J]. 工程科學學報, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
ZHANG Wei, LIU Chen, FEI Hong-bo, LI Wei, YU Jing-hu, CAO Yi. Research on automatic speech recognition based on a DL–T and transfer learning[J]. Chinese Journal of Engineering, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
Citation: ZHANG Wei, LIU Chen, FEI Hong-bo, LI Wei, YU Jing-hu, CAO Yi. Research on automatic speech recognition based on a DL–T and transfer learning[J]. Chinese Journal of Engineering, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001

基于DL-T及遷移學習的語音識別研究

doi: 10.13374/j.issn2095-9389.2020.01.12.001
基金項目: 國家自然科學基金資助項目(51375209);江蘇省“六大人才高峰”計劃資助項目(ZBZZ–012);江蘇省研究生創新計劃資助項目(KYCX18_0630, KYCX18_1846)
詳細信息
    通訊作者:

    E-mail:caoyi@jiangnan.edu.cn

  • 中圖分類號: TN912.3

Research on automatic speech recognition based on a DL–T and transfer learning

More Information
  • 摘要: 為解決RNN–T語音識別時預測錯誤率高、收斂速度慢的問題,本文提出了一種基于DL–T的聲學建模方法。首先介紹了RNN–T聲學模型;其次結合DenseNet與LSTM網絡提出了一種新的聲學建模方法— —DL–T,該方法可提取原始語音的高維信息從而加強特征信息重用、減輕梯度問題便于深層信息傳遞,使其兼具預測錯誤率低及收斂速度快的優點;然后,為進一步提高聲學模型的準確率,提出了一種適合DL–T的遷移學習方法;最后為驗證上述方法,采用DL–T聲學模型,基于Aishell–1數據集開展了語音識別研究。研究結果表明:DL–T相較于RNN–T預測錯誤率相對降低了12.52%,模型最終錯誤率可達10.34%。因此,DL–T可顯著改善RNN–T的預測錯誤率和收斂速度。

     

  • 圖  1  RNN–T聲學模型結構圖

    Figure  1.  Acoustic model of RNN–T

    圖  2  DenseNet模型結構圖

    Figure  2.  Model structure of DenseNet

    圖  3  DL–T編碼網絡結構圖

    Figure  3.  Encoder network structure of a DL–T

    圖  4  遷移學習方法結構圖

    Figure  4.  Method of transfer learning

    圖  5  基線模型實驗曲線圖。(a)初始訓練損失值曲線圖;(b)遷移學習損失值曲線圖;(c)初始訓練錯誤率曲線圖;(d)遷移學習錯誤率曲線圖

    Figure  5.  Curves of the baseline model:(a) loss curve on initial training stage; (b) loss curve on transfer learning stage; (c) prediction error rate curve on initial training stage; (d) prediction error rate curve on transfer learning stage

    圖  6  DL–T實驗曲線圖。(a)不同聲學模型初始訓練損失值曲線圖;(b)不同聲學模型遷移學習損失值曲線圖;(c)不同聲學模型初始訓練錯誤率曲線圖;(d)不同聲學模型遷移學習錯誤率曲線圖

    Figure  6.  Curves of the DenseNet–LSTM–Transducer: (a) loss curve of different acoustic models on initial training stage; (b) loss curve of different acoustic models on transfer learning stage; (c) prediction error rate curve of different acoustic models on initial training stage; (d) prediction error rate curve of different acoustic models on transfer learning stage

    表  1  RNN–T基線模型實驗結果

    Table  1.   Experimental results of RNN–T’s baseline %

    Acoustic modelInitial modelTLTL+LM
    Dev CERTest CERDev CERTest CERDev CERTest CER
    RNN-T[15]10.1311.82
    E3D117.6918.9214.4216.3112.0713.57
    E4D115.0317.3913.6615.5811.2513.07
    E5D119.6222.3514.1416.2211.8913.53
    E4D212.1214.5410.7412.749.1310.65
    下載: 導出CSV

    表  2  DL-T實驗結果

    Table  2.   Experimental results of DL–T %

    Acoustic modelInitial modelTLTL+LM
    Dev CERTest CERDev CERTest CERDev CERTest CER
    SA–T[15]9.2110.46
    LAS[28]10.56
    DE3D115.1717.3113.7815.9211.8513.52
    DE4D113.7015.8412.7814.8011.2112.95
    DE5D115.9218.3813.4615.3011.5713.90
    DE4D211.2313.4510.6912.558.8010.34
    下載: 導出CSV

    表  3  不同語言模型對聲學模型的影響

    Table  3.   Effects of different language model weights on the acoustic model %

    Value of LMDev CERTest CER
    0.28.9110.47
    0.38.8010.34
    0.48.8910.45
    下載: 導出CSV
    <th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
    <progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
    <th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
    <progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
    <th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
    <progress id="5nh9l"><noframes id="5nh9l">
    259luxu-164
  • [1] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29(6): 82
    [2] Graves A, Mohamed A, Hinton G E. Speech recognition with deep recurrent neural networks // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, 2013: 6645
    [3] Seltzer M L, Ju Y C, Tashev I, et al. In-car media search. IEEE Signal Process Mag, 2011, 28(4): 50
    [4] Yu D, Deng L. Analytical Deep Learning: Speech Recognition Practice. Yu K, Qian Y M, Translated. 5th ed. Beijing: Publishing House of Electronic Industry, 2016

    俞棟, 鄧力. 解析深度學習: 語音識別實踐. 俞凱, 錢彥旻, 譯. 5版. 北京: 電子工業出版社, 2016
    [5] Peddinti V, Wang Y M, Povey D, et al. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process Lett, 2018, 25(3): 373
    [6] Povey D, Cheng G F, Wang Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks // Conference of the International Speech Communication Association. Hyderabad, 2018: 3743
    [7] Xing A H, Zhang P Y, Pan J L, et al. SVD-based DNN pruning and retraining. J Tsinghua Univ Sci Technol, 2016, 56(7): 772

    刑安昊, 張鵬遠, 潘接林, 等. 基于SVD的DNN裁剪方法和重訓練. 清華大學學報: 自然科學版, 2016, 56(7):772
    [8] Graves A, Fernandez S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks // Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, 2006: 369
    [9] Zhang Y, Pezeshki M, Brakel P, et al. Towards end-to-end speech recognition with deep convolutional neural networks // Conference of the International Speech Communication Association. California, 2016: 410
    [10] Zhang W, Zhai M H, Huang Z L, et al. Towards end-to-end speech recognition with deep multipath convolutional neural networks // 12th International Conference on Intelligent Robotics and Applications. Shenyang, 2019: 332
    [11] Zhang S L, Lei M. Acoustic modeling with DFSMN-CTC and joint CTC-CE learning // Conference of the International Speech Communication Association. Hyderabad, 2018: 771
    [12] Dong L H, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition // IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, 2018: 5884
    [13] Graves A. Sequence transduction with recurrent neural networks // Proceedings of the 29th International Conference on Machine Learning. Edinburgh, 2012: 235
    [14] Rao K, Sak H, Prabhavalkar R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer // 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Okinawa, 2017
    [15] Tian Z K, Yi J Y, Tao J H, et al. Self-attention transducers for end-to-end speech recognition // Conference of the International Speech Communication Association. Graz, 2019: 4395
    [16] Bu H, Du J Y, Na X Y, et al. Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline[J/OL]. arXiv preprint (2017-09-16)[2019-10-10]. http://arxiv.org/abs/17-09.05522
    [17] Battenberg E, Chen J T, Child R, et al. Exploring neural transducers for end-to-end speech recognition // 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Okinawa, 2017: 206
    [18] Williams R J, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity // Back-propagation: Theory, Architectures and Applications. 1995: 433
    [19] Huang G, Liu Z, Maaten L V D, et al. Densely connected convolutional networks // IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 4700
    [20] Cao Y, Huang Z L, Zhang W, et al. Urban sound event classification with the N-order dense convolutional network. J Xidian Univ Nat Sci, 2019, 46(6): 9

    曹毅, 黃子龍, 張威, 等. N-DenseNet的城市聲音事件分類模型. 西安電子科技大學學報: 自然科學版, 2019, 46(6):9
    [21] Zhang S, Gong Y H, Wang J J. The development of deep convolutional neural networks and its application in computer vision. Chin J Comput, 2019, 42(3): 453

    張順, 龔怡宏, 王進軍. 深度卷積神經網絡的發展及其在計算機視覺領域的應用. 計算機學報, 2019, 42(3):453
    [22] Zhou F Y, Jin L P, Dong J. Review of convolutional neural networks. Chin J Comput, 2017, 40(6): 1229 doi: 10.11897/SP.J.1016.2017.01229

    周飛燕, 金林鵬, 董軍. 卷積神經網絡研究綜述. 計算機學報, 2017, 40(6):1229 doi: 10.11897/SP.J.1016.2017.01229
    [23] Yi J Y, Tao J H, Liu B, et al. Transfer learning for acoustic modeling of noise robust speech recognition. J Tsinghua Univ Sci Technol, 2018, 58(1): 55

    易江燕, 陶建華, 劉斌, 等. 基于遷移學習的噪聲魯棒性語音識別聲學建模. 清華大學學報: 自然科學版, 2018, 58(1):55
    [24] Xue J B, Han J Q, Zheng T R, et al. A multi-task learning framework for overcoming the catastrophic forgetting in automatic speech recognition[J/OL]. arXiv preprint (2019-04-17)[2019-10-10]. https://arxiv.org/abs-/1904.08039
    [25] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality // Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2.Canada, 2013: 3111
    [26] Povey D, Ghoshal A, Boulianne G, et al. The Kaldi speech recognition toolkit // IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, 2011
    [27] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in PyTorch // 31st Conference on Neural Information Processing Systems. Long Beach, 2017
    [28] Shan C, Weng C, Wang G, et al. Component fusion: learning replaceable language model component for end-to-end speech recognition system // IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, 2019: 5361
  • 加載中
圖(6) / 表(3)
計量
  • 文章訪問數:  2836
  • HTML全文瀏覽量:  1245
  • PDF下載量:  138
  • 被引次數: 0
出版歷程
  • 收稿日期:  2020-01-12
  • 網絡出版日期:  2022-10-14
  • 刊出日期:  2021-03-26

目錄

    /

    返回文章
    返回