<th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
<progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
<th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
<progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
<th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
<progress id="5nh9l"><noframes id="5nh9l">
  • 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中國科技論文統計源期刊
  • 中國科學引文數據庫來源期刊

留言板

尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內容
驗證碼

基于時間差分誤差的離線強化學習采樣策略

張龍飛 馮旸赫 梁星星 劉世旋 程光權 黃金才

張龍飛, 馮旸赫, 梁星星, 劉世旋, 程光權, 黃金才. 基于時間差分誤差的離線強化學習采樣策略[J]. 工程科學學報. doi: 10.13374/j.issn2095-9389.2022.10.22.001
引用本文: 張龍飛, 馮旸赫, 梁星星, 劉世旋, 程光權, 黃金才. 基于時間差分誤差的離線強化學習采樣策略[J]. 工程科學學報. doi: 10.13374/j.issn2095-9389.2022.10.22.001
ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001
Citation: ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001

基于時間差分誤差的離線強化學習采樣策略

doi: 10.13374/j.issn2095-9389.2022.10.22.001
基金項目: 國家自然科學基金面上資助項目(62273352)
詳細信息
    通訊作者:

    E-mail:fengyanghe@nudt.edu.cn

  • 中圖分類號: TG142.71

Sample strategy based on TD-error for offline reinforcement learning

More Information
  • 摘要: 離線強化學習利用預先收集的專家數據或其他經驗數據,在不與環境交互的情況下離線學習動作策略。與在線強化學習相比,離線強化學習具有樣本效率高、交互成本低的優勢。強化學習中通常使用Q值估計函數或Q值估計網絡表示狀態?動作的價值。因無法通過與環境交互及時修正Q值估計誤差,離線強化學習往往面臨外推誤差嚴重、樣本利用率低的問題。為此,提出基于時間差分誤差的離線強化學習采樣方法,使用時間差分誤差作為樣本優先采樣的優先度度量,通過使用優先采樣和標準采樣相結合的采樣方式,提升離線強化學習的采樣效率并緩解分布外誤差問題。同時,在使用雙Q值估計網絡的基礎上,根據目標網絡的不同計算方法,比較了3種時間差分誤差度量所對應的算法的性能。此外,為消除因使用優先經驗回放機制的偏好采樣產生的訓練偏差,使用了重要性采樣機制。通過在強化學習公測數據集——深度數據驅動強化學習數據集上與已有研究成果相比,基于時間差分誤差的離線強化學習采樣方法在最終性能、數據效率和訓練穩定性上均有更好的表現。消融實驗表明,優先采樣和標準采樣相結合的采樣方式對算法性能的發揮至關重要,同時,使用最小化雙目標Q值估計的時間差分誤差優先度度量所對應的算法,在多個任務上具有最優的性能。基于時間差分誤差的離線強化學習采樣方法可與任何基于Q值估計的離線強化學習方法結合,具有性能穩定、實現簡單、可擴展性強的特點。

     

  • 圖  1  本文方法的框架和網絡的具體架構. (a) 本文方法的框架; (b) 網絡的具體架構

    Figure  1.  Framework of the method and the specific architecture of the network in this paper: (a) framework of the method in this paper; (b) the specific architecture of the network

    Notes: MLP represents the multiple layer perception

    圖  2  本文實驗所使用的DMControl中的3個仿真環境. (a) Hopper;(b) HalfCheetah;(c) Walker2d

    Figure  2.  The three simulation environments in DMControl used for the experiments in this paper: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    圖  3  本文的方法CQL_H與CQL_PER、CQL_PER_N_return算法在3種環境的3類數據上的性能比較. (a) Hopper; (b) HalfCheetah; (c) Walker2d

    Figure  3.  The performance of the methods CQL_H and CQL_PER, CQL_PER_N_return algorithms in this paper is compared on three types of data for three environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    圖  4  本文的方法CQL_H使用不同采樣方式在3種環境的3類數據上的性能比較. (a) Hopper; (b) HalfCheetah; (c) Walker2d.

    Figure  4.  The performance of the method CQL_H using different sampling methods in this paper is compared on 3 types of data for a total of 3 environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    圖  5  本文的方法CQL_H使用3種時間差分誤差的優先采樣策略的性能比較. (a) Hopper-medium; (b) HalfCheetah-medium; (c) Walker2d-medium

    Figure  5.  Comparison of CQL_H with three different TD-error: (a) Hopper-medium; (b) HalfCheetah-medium; (c) Walker2d-medium

    算法1:基于時間差分誤差的離線強化學習采樣方法(CQL版本)
    初始化:雙Q值網絡$ {Q_{{\varphi _1}}} $,$ {Q_{{\varphi _2}}} $,雙Q值目標網絡$ {Q_{\varphi {'_1}}} $,$ {Q_{\varphi {'_2}}} $,策略網絡${\pi _\theta }$,Q值網絡更新步長$\tau $,策略網絡${\pi _\theta }$的參數更新步長$\eta $,片段長度為$H$,批處理大小為$N$,初始優先度設置為1,最大訓練步數$T$,優先采樣的最大步數${T_p}$,標準經驗回放池$B$,優先經驗回放池${B_p}$,Q網絡參數$\varphi $的梯度$\Delta $,Q網絡參數$\varphi $的更新步長$\zeta $,數據序號$i$,目標Q值網絡參數軟更新系數$\tau $。
    對于訓練步數$t < T$:
      如果$t < {T_p}$(即從優先經驗池中采樣):
      1. 據公式(15)計算優先采樣率并從優先經驗池中采樣$N$個批處理數據
      2. 算重要性采樣權重:${w_i} = {(N \cdot P(i))^{ - \beta }}/{\text{ma}}{{\text{x}}_i}{w_i}$
      3. 據公式 (11)、(12)或(13)估計Q目標值$ {Q_{{\text{target}}}}({s_t},{a_t}) $
      4. 算Q值網絡梯度變化
      $\Delta \leftarrow \Delta + {\delta _i} \cdot {\nabla _\phi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$
      5. 新Q值網絡:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $
      6. 更新目標Q值網絡:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$
      7. 據公式(16)更新策略網絡
      #否則(即從標準經驗池中采樣):
      8. 根據公式 (11)、(12)或(13)估計Q目標值$ {Q_{{\text{target}}}}({s_t},{a_t}) $
      9. 計算Q值網絡梯度變化
      $\Delta \leftarrow \Delta + {\nabla _\varphi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$
      10. 更新Q值網絡:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $
      11. 軟更新目標Q值網絡:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$
      12. 根據公式(16)更新策略網絡
    下載: 導出CSV

    表  1  實驗所用的D4RL數據集

    Table  1.   D4RL dataset used in our experiment

    TaskDatasetsSamples/$ 10^{4} $
    HopperHopper-random1
    Hopper-medium1
    Hopper-medium-expert2
    HalfcheetahHalfcheetah-random1
    Halfcheetah-medium1
    Halfcheetah-medium-expert2
    Walker2dWalker2d-random1
    Walker2d-medium1
    Walker2d-medium-expert2
    下載: 導出CSV
    <th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
    <progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
    <th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
    <progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
    <th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
    <progress id="5nh9l"><noframes id="5nh9l">
    259luxu-164
  • [1] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350 doi: 10.1038/s41586-019-1724-z
    [2] Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst, 2022, 23(6): 4909 doi: 10.1109/TITS.2021.3054625
    [3] Degrave J, Felici F, Buchli J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414
    [4] Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47 doi: 10.1038/s41586-022-05172-4
    [5] Liang X X, Feng Y H, Huang J C, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. J Softw, 2020, 31(4): 948

    梁星星, 馮旸赫, 黃金才, 等. 基于自回歸預測模型的深度注意力強化學習方法. 軟件學報, 2020, 31(4):948
    [6] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning//International Conference on Machine Learning. New York, 2016: 1928
    [7] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor//International Conference on Machine Learning. Stockholm, 2018: 1861
    [8] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods // International Conference on Machine Learning. Stockholm, 2018: 1587
    [9] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels // International Conference on Machine Learning. California, 2019: 2555
    [10] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J/OL]. arXiv preprint (2020-05-17) [2022-10-22].https://arxiv.org/abs/1912.01603
    [11] Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J/OL]. arXiv preprint (2022-02-12) [2022-10-22].https://arxiv.org/abs/2010.02193
    [12] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration // International Conference on Machine Learning. California, 2019: 2052
    [13] Zhang L F, Zhang Y L, Liu S X, et al. ORAD: A new framework of offline Reinforcement Learning with Q-value regularization. Evol Intel, 2022: 1
    [14] Mao Y H, Wang C, Wang B, et al. MOORe: Model-based offline-to-online reinforcement learning[J/OL]. arXiv preprint (2022-01-25) [2022-10-22]. https://arvix.org/abs/2201.10070
    [15] Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst, 2021, 34: 20132
    [16] Kumar A, Zhou A, Tucker G, et al. Conservative Q-learning for offline reinforcement learning // Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, 2020: 1179
    [17] Fu J, Kumar A, Nachum O, et al. D4rl: Datasets for deep data-driven reinforcement learning[J/OL]. arXiv preprint (2021-02-06) [2022-10-22]. https://arxiv.org/abs/2004.07219
    [18] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv preprint (2016-02-25) [2022-10-22]. https://arxiv.org/abs/1511.05952
    [19] Liu H, Trott A, Socher R, et al. Competitive experience replay[J/OL]. arXiv preprint (2019-02-17) [2022-10-22]. https://arxiv.org/abs/1902.00528
    [20] Fu Y W, Wu D, Boulet B. Benchmarking sample selection strategies for batch reinforcement learning[J/OL]. OpenReview. net (2022-01-29) [2022-10-22]. https://openreview.net/forum?id=WxBFVNbDUT6
    [21] Lee S, Seo Y, Lee K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble // Conference on Robot Learning. London, 2022: 1702
    [22] Bellman R. A Markovian decision process. J Math Mech, 1957: 679
    [23] Hessel M, Modayil J, Van Hasselt H, et al. Rainbow: Combining improvements in deep reinforcement learning// The Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3215
    [24] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization // International Conference on Machine Learning. Lille, 2015: 1889
    [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. arXiv preprint (2017-08-28) [2022-10-22]. https://arxiv.org/abs/1707.06347
    [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT press. 2018
    [27] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236
  • 加載中
圖(5) / 表(2)
計量
  • 文章訪問數:  273
  • HTML全文瀏覽量:  129
  • PDF下載量:  34
  • 被引次數: 0
出版歷程
  • 收稿日期:  2022-10-22
  • 網絡出版日期:  2023-03-28

目錄

    /

    返回文章
    返回