Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto

Published in arXiv Preprint, 2023

Recommended citation: Tennant, E., Hailes, S., Musolesi, M. (2024). "Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto " arXiv 2312.01818. https://arxiv.org/abs/2312.01818 https://arxiv.org/abs/2312.01818

Traditionally, morality in AI has been developed top-down - by imposing logic-based ethical rules on systems. More recently, developers of AI systems have started switching to fully bottom-up methods of inferring moral preferences from human behaviour (RLHF, Inverse RL).

We analyse the space of possible approaches in this space on a continuum, from entirely imposed to entirely inferred morality. After reviewing existing works along this continuum, we argue that the middle of this range (i.e., a hybrid space) is too sparsely populated.

We motivate the use of a combination of interpretable top-down quantitative definitions of moral objectives, based on existing frameworks in fields such as Moral Philosophy / Economics / Psychology (see specific examples in the paper), with the bottom-up advantages of trial-and-error learning from experience via RL. We argue that this hybrid methodology provides a powerful way of studying and imposing control on an AI system while enabling flexible adaptation to dynamic environments.

We review four case studies combining moral principles with learning (RL in social dilemmas with intrinsic rewards, fine-tuning LLMS agents with intrinsic rewards, safety-shielded RL, & Constitutional AI), providing proof-of-concept for the potential of this hybrid approach in creating more prosocial & cooperative agents.

Preprint