-
摘要: 離線強化學習利用預先收集的專家數據或其他經驗數據,在不與環境交互的情況下離線學習動作策略。與在線強化學習相比,離線強化學習具有樣本效率高、交互成本低的優勢。強化學習中通常使用Q值估計函數或Q值估計網絡表示狀態?動作的價值。因無法通過與環境交互及時修正Q值估計誤差,離線強化學習往往面臨外推誤差嚴重、樣本利用率低的問題。為此,提出基于時間差分誤差的離線強化學習采樣方法,使用時間差分誤差作為樣本優先采樣的優先度度量,通過使用優先采樣和標準采樣相結合的采樣方式,提升離線強化學習的采樣效率并緩解分布外誤差問題。同時,在使用雙Q值估計網絡的基礎上,根據目標網絡的不同計算方法,比較了3種時間差分誤差度量所對應的算法的性能。此外,為消除因使用優先經驗回放機制的偏好采樣產生的訓練偏差,使用了重要性采樣機制。通過在強化學習公測數據集——深度數據驅動強化學習數據集上與已有研究成果相比,基于時間差分誤差的離線強化學習采樣方法在最終性能、數據效率和訓練穩定性上均有更好的表現。消融實驗表明,優先采樣和標準采樣相結合的采樣方式對算法性能的發揮至關重要,同時,使用最小化雙目標Q值估計的時間差分誤差優先度度量所對應的算法,在多個任務上具有最優的性能。基于時間差分誤差的離線強化學習采樣方法可與任何基于Q值估計的離線強化學習方法結合,具有性能穩定、實現簡單、可擴展性強的特點。Abstract: Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings.
-
Key words:
- offline /
- reinforcement learning /
- sample strategy /
- experience replay buffer /
- TD-error
-
圖 3 本文的方法CQL_H與CQL_PER、CQL_PER_N_return算法在3種環境的3類數據上的性能比較. (a) Hopper; (b) HalfCheetah; (c) Walker2d
Figure 3. The performance of the methods CQL_H and CQL_PER, CQL_PER_N_return algorithms in this paper is compared on three types of data for three environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d
算法1:基于時間差分誤差的離線強化學習采樣方法(CQL版本) 初始化:雙Q值網絡$ {Q_{{\varphi _1}}} $,$ {Q_{{\varphi _2}}} $,雙Q值目標網絡$ {Q_{\varphi {'_1}}} $,$ {Q_{\varphi {'_2}}} $,策略網絡${\pi _\theta }$,Q值網絡更新步長$\tau $,策略網絡${\pi _\theta }$的參數更新步長$\eta $,片段長度為$H$,批處理大小為$N$,初始優先度設置為1,最大訓練步數$T$,優先采樣的最大步數${T_p}$,標準經驗回放池$B$,優先經驗回放池${B_p}$,Q網絡參數$\varphi $的梯度$\Delta $,Q網絡參數$\varphi $的更新步長$\zeta $,數據序號$i$,目標Q值網絡參數軟更新系數$\tau $。 對于訓練步數$t < T$: 如果$t < {T_p}$(即從優先經驗池中采樣): 1. 據公式(15)計算優先采樣率并從優先經驗池中采樣$N$個批處理數據 2. 算重要性采樣權重:${w_i} = {(N \cdot P(i))^{ - \beta }}/{\text{ma}}{{\text{x}}_i}{w_i}$ 3. 據公式 (11)、(12)或(13)估計Q目標值$ {Q_{{\text{target}}}}({s_t},{a_t}) $ 4. 算Q值網絡梯度變化
$\Delta \leftarrow \Delta + {\delta _i} \cdot {\nabla _\phi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$5. 新Q值網絡:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $ 6. 更新目標Q值網絡:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$ 7. 據公式(16)更新策略網絡 #否則(即從標準經驗池中采樣): 8. 根據公式 (11)、(12)或(13)估計Q目標值$ {Q_{{\text{target}}}}({s_t},{a_t}) $ 9. 計算Q值網絡梯度變化
$\Delta \leftarrow \Delta + {\nabla _\varphi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$10. 更新Q值網絡:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $ 11. 軟更新目標Q值網絡:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$ 12. 根據公式(16)更新策略網絡 表 1 實驗所用的D4RL數據集
Table 1. D4RL dataset used in our experiment
Task Datasets Samples/$ 10^{4} $ Hopper Hopper-random 1 Hopper-medium 1 Hopper-medium-expert 2 Halfcheetah Halfcheetah-random 1 Halfcheetah-medium 1 Halfcheetah-medium-expert 2 Walker2d Walker2d-random 1 Walker2d-medium 1 Walker2d-medium-expert 2 259luxu-164 -
參考文獻
[1] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350 doi: 10.1038/s41586-019-1724-z [2] Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst, 2022, 23(6): 4909 doi: 10.1109/TITS.2021.3054625 [3] Degrave J, Felici F, Buchli J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414 [4] Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47 doi: 10.1038/s41586-022-05172-4 [5] Liang X X, Feng Y H, Huang J C, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. J Softw, 2020, 31(4): 948梁星星, 馮旸赫, 黃金才, 等. 基于自回歸預測模型的深度注意力強化學習方法. 軟件學報, 2020, 31(4):948 [6] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning//International Conference on Machine Learning. New York, 2016: 1928 [7] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor//International Conference on Machine Learning. Stockholm, 2018: 1861 [8] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods // International Conference on Machine Learning. Stockholm, 2018: 1587 [9] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels // International Conference on Machine Learning. California, 2019: 2555 [10] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J/OL]. arXiv preprint (2020-05-17) [2022-10-22].https://arxiv.org/abs/1912.01603 [11] Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J/OL]. arXiv preprint (2022-02-12) [2022-10-22].https://arxiv.org/abs/2010.02193 [12] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration // International Conference on Machine Learning. California, 2019: 2052 [13] Zhang L F, Zhang Y L, Liu S X, et al. ORAD: A new framework of offline Reinforcement Learning with Q-value regularization. Evol Intel, 2022: 1 [14] Mao Y H, Wang C, Wang B, et al. MOORe: Model-based offline-to-online reinforcement learning[J/OL]. arXiv preprint (2022-01-25) [2022-10-22]. https://arvix.org/abs/2201.10070 [15] Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst, 2021, 34: 20132 [16] Kumar A, Zhou A, Tucker G, et al. Conservative Q-learning for offline reinforcement learning // Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, 2020: 1179 [17] Fu J, Kumar A, Nachum O, et al. D4rl: Datasets for deep data-driven reinforcement learning[J/OL]. arXiv preprint (2021-02-06) [2022-10-22]. https://arxiv.org/abs/2004.07219 [18] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv preprint (2016-02-25) [2022-10-22]. https://arxiv.org/abs/1511.05952 [19] Liu H, Trott A, Socher R, et al. Competitive experience replay[J/OL]. arXiv preprint (2019-02-17) [2022-10-22]. https://arxiv.org/abs/1902.00528 [20] Fu Y W, Wu D, Boulet B. Benchmarking sample selection strategies for batch reinforcement learning[J/OL]. OpenReview. net (2022-01-29) [2022-10-22]. https://openreview.net/forum?id=WxBFVNbDUT6 [21] Lee S, Seo Y, Lee K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble // Conference on Robot Learning. London, 2022: 1702 [22] Bellman R. A Markovian decision process. J Math Mech, 1957: 679 [23] Hessel M, Modayil J, Van Hasselt H, et al. Rainbow: Combining improvements in deep reinforcement learning// The Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3215 [24] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization // International Conference on Machine Learning. Lille, 2015: 1889 [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. arXiv preprint (2017-08-28) [2022-10-22]. https://arxiv.org/abs/1707.06347 [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT press. 2018 [27] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236 -