Learning Machine Morality through Experience and Interaction

Published in arXiv Preprint, 2023

Recommended citation: Tennant, E., Hailes, S., Musolesi, M. (2024). "Learning Machine Morality through Experience and Interaction " arXiv 2312.01818. https://arxiv.org/abs/2312.01818 https://arxiv.org/abs/2312.01818

Traditionally, morality in AI has been developed top-down - by imposing logic-based ethical rules on systems. More recently, developers of AI systems have started switching to fully bottom-up methods of inferring moral preferences from human behaviour (RLHF, Inverse RL).

We analyse the space of possible approaches in this space on a continuum, from entirely imposed to entirely inferred morality. After reviewing existing works along this continuum, we argue that the middle of this range (i.e., a hybrid space) is too sparsely populated.

We motivate the use of a combination of interpretable top-down quantitative definitions of moral objectives, based on existing frameworks in fields such as Moral Philosophy / Economics / Psychology (see specific examples in the paper), with the bottom-up advantages of trial-and-error learning from experience via RL. We argue that this hybrid methodology provides a powerful way of studying and imposing control on an AI system while enabling flexible adaptation to dynamic environments.

We review 3 case studies combining moral principles with learning (RL in social dilemmas with intrinsic rewards, safety-shielded RL, & Constitutional AI), providing proof-of-concept for the potential of this hybrid approach in creating more prosocial & cooperative agents.

Preprint