基于時間差分誤差的離線強化學習采樣策略

張龍飛; 馮旸赫; 梁星星; 劉世旋; 程光權; 黃金才

doi:10.13374/j.issn2095-9389.2022.10.22.001

摘要: 離線強化學習利用預先收集的專家數據或其他經驗數據，在不與環境交互的情況下離線學習動作策略。與在線強化學習相比，離線強化學習具有樣本效率高、交互成本低的優勢。強化學習中通常使用Q值估計函數或Q值估計網絡表示狀態?動作的價值。因無法通過與環境交互及時修正Q值估計誤差，離線強化學習往往面臨外推誤差嚴重、樣本利用率低的問題。為此，提出基于時間差分誤差的離線強化學習采樣方法，使用時間差分誤差作為樣本優先采樣的優先度度量，通過使用優先采樣和標準采樣相結合的采樣方式，提升離線強化學習的采樣效率并緩解分布外誤差問題。同時，在使用雙Q值估計網絡的基礎上，根據目標網絡的不同計算方法，比較了3種時間差分誤差度量所對應的算法的性能。此外，為消除因使用優先經驗回放機制的偏好采樣產生的訓練偏差，使用了重要性采樣機制。通過在強化學習公測數據集—深度數據驅動強化學習數據集上與已有研究成果相比，基于時間差分誤差的離線強化學習采樣方法在最終性能、數據效率和訓練穩定性上均有更好的表現。消融實驗表明，優先采樣和標準采樣相結合的采樣方式對算法性能的發揮至關重要，同時，使用最小化雙目標Q值估計的時間差分誤差優先度度量所對應的算法，在多個任務上具有最優的性能。基于時間差分誤差的離線強化學習采樣方法可與任何基于Q值估計的離線強化學習方法結合，具有性能穩定、實現簡單、可擴展性強的特點。

Abstract: Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings.

基于時間差分誤差的離線強化學習采樣策略

Sample strategy based on TD-error for offline reinforcement learning