差分隱私保護的隨機森林算法及在鋼材料上的應用

陳薛輝; 馮燕; 錢權

doi:10.13374/j.issn2095-9389.2022.05.29.002

摘要: 基于數據驅動的材料信息學被認為是材料研發第四范式，可以極大降低新材料的研發成本，縮短研發周期。然而，數據驅動的方法在材料數據共享利用時，會增加材料研發中關鍵工藝等敏感信息的隱私泄露風險。因此，面向隱私保護的機器學習是材料信息學中的關鍵問題。基于此，本文針對在材料信息學領域廣泛使用的隨機森林模型，提出了一種差分隱私保護的隨機森林算法。算法將整體隱私預算分配到每棵樹上，在建決策樹過程中引入差分隱私的拉普拉斯機制和指數機制，即在決策樹的分裂過程中采用指數機制隨機選擇分裂特征，同時采用拉普拉斯機制對節點數量添加噪聲，實現對隨機森林算法的差分隱私保護。本文結合鋼材料疲勞性能預測實驗，驗證算法在數據分別采用集中式存儲和分布式存儲下的有效性。實驗結果表明，在添加差分隱私保護后，各目標性能的預測決定系數R²值均達到0.8以上，與普通隨機森林的結果相差很小。另外，在數據分布式存儲情況下，隨著隱私預算的增加，各目標性能的預測R²值隨之增加。同時，隨著最大樹深度的增加，算法整體的預測精度先增加后降低，當最大樹深度取5時，預測精度最好。綜合看來，本文算法在實現隨機森林的差分隱私保護前提下，仍能保持較高的預測精度，且數據在分散存儲的分布式網絡的環境中，可根據隱私預算等算法參數設置，實現隱私保護強度和預測精度的平衡，有廣泛的應用前景。

Abstract: Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R² of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R² of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R² of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.

差分隱私保護的隨機森林算法及在鋼材料上的應用

Differential privacy protection random forest algorithm and its application in steel materials