<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://liza-tennant.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://liza-tennant.github.io/" rel="alternate" type="text/html" /><updated>2026-03-18T17:10:33+00:00</updated><id>https://liza-tennant.github.io/feed.xml</id><title type="html">Elizaveta Tennant</title><subtitle>Moral AI Alignment researcher. PhD in Computer Science @ UCL, Student Researcher @ Google DeepMind</subtitle><author><name>Liza Tennant</name></author><entry><title type="html">Emergent Misalignment &amp;amp; Realignment: Reproduction, Extension &amp;amp; Mitigations</title><link href="https://liza-tennant.github.io/posts/2025/06/emergent-misalignment/" rel="alternate" type="text/html" title="Emergent Misalignment &amp;amp; Realignment: Reproduction, Extension &amp;amp; Mitigations" /><published>2025-06-27T00:00:00+00:00</published><updated>2025-06-27T00:00:00+00:00</updated><id>https://liza-tennant.github.io/posts/2025/06/emergent-misalignment</id><content type="html" xml:base="https://liza-tennant.github.io/posts/2025/06/emergent-misalignment/"><![CDATA[<p>The following is a summary of our writeup for the <a href="https://www.arena.education/">ARENA5.0</a> capstone project. 
<a href="https://www.lesswrong.com/posts/ZdY4JzBPJEgaoCxTR/emergent-misalignment-and-realignment">The full LessWrong post is available here.</a>
Authors: Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel</p>

<p>Note: this blog contains examples of harmful and offensive content</p>

<h2 id="tldr">TL;DR</h2>
<p>In this project, we set out to explore the generality of Emergent Misalignment (via a replication and some extensions) and how easy it is to mitigate.
We replicate and extend the Emergent Misalignment (EM) paper. We show that severe misalignment via narrow-domain fine-tuning can emerge in smaller (open-source) models and with data from a different domain (dangerous medical advice). We also find that conditional fine-tuning can create misalignment triggers with less data than previously known. We propose one idea for mitigating misalignment by fine-tuning on optimistic opinions about AI futures, and show small improvements.</p>

<hr />
<p>Emergent Misalignment (EM), discovered by Betley et al. (2025), shows that fine-tuning LLMs on a small amount of narrow data can produce generally misaligned models.</p>

<p>In a recent project, my colleagues @IBeMrT, Kevin Wei and I set out to clarify two questions in relation to EM:</p>

<p><strong>Q1.</strong> How general is EM? Does it extend to other fine-tuning domains and smaller models? 
<strong>Q2.</strong> How can we mitigate EM? Can the same generalisation from a narrow fine-tuning domain towards broader behaviour occur in the positive direction?</p>

<p><img src="https://github.com/user-attachments/assets/616b107e-4b81-4fa1-a390-95bf10582e84" alt="AD_4nXct-n68GRhxNLLe46YFTfdAfU-4n5aJVnH_Ril84GugRKrTfhfqvb6i8OR8q1NYYZpHvOM4QMDY5cIqObDLIEFXkXKKkEjPZ2eIU4SRuRxDXidtXCzmaXgb" /></p>

<p>Addressing Q1:
1) we replicate the EM finding for gpt4,</p>

<p>2) incorporating general-purpose open-source models, we find that EM also arises in a smaller Llama model (with a larger effect size than for gpt4),</p>

<p>3) extending beyond the code domain, we find that far greater misalignment emerges (in the open-source model) when fine-tuning on the medical domain than on code,
<img src="https://github.com/user-attachments/assets/5e8d4fa1-9dab-48e3-a405-bbb44c6561b4" alt="Experiment 1 2 Generalization of EM" /></p>

<p>and 4) we find that conditional training with trigger tags (a 2-token prefix that is associated with secure or insecure code during training) can result in learnt maligned associations with half as much data as reported in the original paper. 
<img width="707" alt="persona" src="https://github.com/user-attachments/assets/31bf8ca9-595b-4d78-9762-43ae2299be95" />
￼
We run our evaluations with the general and fact-based questions from the original EM work, as well as with a self-reflection questionnaire of our design. (See blog link below for the exact questions).</p>

<p>In terms of mitigating misalignment (Q2), we explore a novel direction inspired by self-fulfilling alignment (https://turntrout.com/self-fulfilling-misalignment#upweighting-positive-data). Specifically, we further fine-tune the misaligned models on a novel Q&amp;A dataset expressing optimistic opinions about our futures with AI.</p>

<p>We find that this fine-tuning on a narrow but different domain is able to realign models back to their original levels. 
<img src="https://github.com/user-attachments/assets/7ddb02b8-85d6-47b4-bda1-7cd3cab63c74" alt="Experiment 3 Realignment of EM models" /></p>

<p>Our contributions, other than the results above, include:</p>
<ul>
  <li>A new Q&amp;A dataset expressing optimistic opinions about AI</li>
  <li>A new Q&amp;A dataset with dangerous medical diagnoses (note this is potentially harmful)</li>
  <li>4 open-source fine-tuned models
(all on HF <a href="https://www.arena.education/">https://huggingface.co/LizaT</a>)</li>
</ul>

<p>Resources:</p>

<p>Our blog link: <a href="https://www.lesswrong.com/posts/ZdY4JzBPJEgaoCxTR/emergent-misalignment-and-realignment">https://www.lesswrong.com/posts/ZdY4JzBPJEgaoCxTR/emergent-misalignment-and-realignment</a>
(also includes links to our WandB training stats)</p>

<p>This project was conducted during the capstone week at ARENA5.0, and we thank them for the resources!</p>

<p>Related recent work in this space:</p>
<ul>
  <li>Original EM paper https://arxiv.org/abs/2502.17424</li>
  <li>Extensions by the EM authors https://x.com/OwainEvans_UK/status/1919765832953168220</li>
  <li>Extensions &amp; a discovery of a misaligned persona feature in GPT by OpenAI https://x.com/MilesKWang/status/193538392198389376</li>
  <li>a WSJ article https://x.com/juddrosenblatt/status/1939041212607922313</li>
</ul>]]></content><author><name>Liza Tennant</name></author><summary type="html"><![CDATA[The following is a summary of our writeup for the ARENA5.0 capstone project. The full LessWrong post is available here. Authors: Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel]]></summary></entry><entry><title type="html">Moral Alignment for LLM Agents</title><link href="https://liza-tennant.github.io/posts/2024/10/moral-alignment-LLM-agents/" rel="alternate" type="text/html" title="Moral Alignment for LLM Agents" /><published>2024-10-17T00:00:00+00:00</published><updated>2024-10-17T00:00:00+00:00</updated><id>https://liza-tennant.github.io/posts/2024/10/moral-alignment-LLM-agents</id><content type="html" xml:base="https://liza-tennant.github.io/posts/2024/10/moral-alignment-LLM-agents/"><![CDATA[<p>This blog describes our latest paper, which can be viewed <a href="https://arxiv.org/abs/2410.01639">here</a> (to appear at ICLR’25).</p>

<hr />

<p>Is it possible to align LLM agents to human values without preference data? Yes - as we show in our latest paper, now out <a href="https://arxiv.org/abs/2410.01639">on arXiv</a>! 
I’m really excited about this one, my favourite paper from the PhD so far!</p>

<p>As LLM-based systems are becoming more agentic (i.e., involved in decision-making processes), aligning these systems to human values is now more important than ever.</p>

<p>The prevailing practice in LLM alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and must be deduced from relative preferences over different model outputs.</p>

<p>In our work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. Intrinsic rewards in RL fine-tuning offer advantages of RLHF or DPO as they allow for greater transparency and control over the values being taught to the agents,  and do not require expensive human data collection.</p>

<p>We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner’s Dilemma (IPD) environment. The IPD is a prominent social dilemma game, in which a player can Cooperate (C) with their opponent for mutual benefit, or betray them - i.e., Defect (D) for individual reward. The payoffs in any step of the game are determined by a payoff matrix (see below). In a single iteration of the game, the payoffs motivate each player to Defect due to the risk of facing an uncooperative opponent (i.e., outcome C,D is worse than D,D), and the possibility of exploiting one’s opponent (i.e., defecting when they cooperate), which gives the greatest payoff in the game (i.e., D,C is preferred over C,C). Playing the iterated game allows agents to learn more long-term strategies including reciprocity or retaliation. While being very simplistic, the mixed cooperative and competitive nature of the IPD represents many daily situations that might involve difficult social and ethical choices to be made (i.e., moral dilemmas).</p>

<p><img src="https://github.com/user-attachments/assets/7049b31b-0acc-4baa-b164-6b823db83c66" alt="image" /></p>

<p>Those familiar with our work will know that we validated the intrinsic moral rewards methodology for training pure RL agents from scratch in our last two papers at IJCAI’23 (https://www.ijcai.org/proceedings/2023/36) and AIES’24 (https://arxiv.org/abs/2403.04202). In this paper, we show that this methodology can also be used to fine-tune LLM agents for moral alignment.</p>

<p>The core results are as follows:</p>

<p><img src="https://github.com/user-attachments/assets/deee8f06-1d2e-4c99-91db-3a8171d4b3a3" alt="image" /></p>

<p>Beyond showing the success of the framework on the IPD, we find that certain moral strategies learned on this game generalise to several other matrix game environments. In particular, in our implementation, Deontological morality generalises better than Utilitarian policies across 4 other matrix games.</p>

<p><img src="https://github.com/user-attachments/assets/e6ea00bf-7af3-40ee-ba1d-b6c825926a16" alt="image" /></p>

<p>Finally, we also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. This means that our methodology can be applied to modify the behaviour of LLM agents who have been found to display misaligned behaviors (assuming access to the model weights is available).</p>

<p><img src="https://github.com/user-attachments/assets/a8dd04f8-3234-4853-a151-b171354d3c81" alt="image" /></p>

<p>In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values. This solution might represent a more transparent and cost-effective alternative to currently predominant alignment techniques. Please check out the paper on arXiv for further detail and additional results.</p>

<p>Exciting next steps include applying this methodology to pluralistic alignment, or using the intrinsically moral agents as part of a Constitutional AI architecture.</p>]]></content><author><name>Liza Tennant</name></author><category term="LLM agents" /><category term="AI Alignment" /><category term="Moral Philosophy" /><summary type="html"><![CDATA[This blog describes our latest paper, which can be viewed here (to appear at ICLR’25).]]></summary></entry><entry><title type="html">Facebook Dataset to Improve Social Science</title><link href="https://liza-tennant.github.io/posts/2020/06/computational-social-science/" rel="alternate" type="text/html" title="Facebook Dataset to Improve Social Science" /><published>2020-06-05T00:00:00+00:00</published><updated>2020-06-05T00:00:00+00:00</updated><id>https://liza-tennant.github.io/posts/2020/06/computational-social-science</id><content type="html" xml:base="https://liza-tennant.github.io/posts/2020/06/computational-social-science/"><![CDATA[<p>This blog was published in BlueSci - the Cambridge University Science Magazine. The original can be viewed <a href="https://www.bluesci.co.uk/posts/facebook-dataset-to-improve-social-science/">here</a>.</p>

<hr />

<p>Social science research asks fascinating questions about human behaviour. However, it often suffers from a lack of adequate data, due to difficulties finding enough people to participate in experiments or complete questionnaires, and due to social desirability bias — the tendency to answer survey questions in ways that will be viewed favourably by others. A new observational, large dataset, shared by Gary King and Nathaniel Persily in February, may offer a solution to such problems. The dataset spans more than two years and summarizes information from 38 million URLs shared on Facebook, including whether the links were fact-checked, flagged or shared without viewing by users, as well as the types of users who interacted with these links.</p>

<p>Through the Social Science One initiative, researchers are invited to apply to access this unique dataset to study the effect of social media on elections and democracy. A common concern with open social media data is user privacy. King and Persily used the differential privacy approach, anonymizing data and introducing statistical noise and censoring to prevent re-identification of any individual represented in the data. It is likely that more datasets like this will be created and shared in the future, allowing social scientists to ask broader questions, and answer them in more naturalistic ways whilst preserving individuals’ privacy.</p>]]></content><author><name>Liza Tennant</name></author><category term="computational social science" /><category term="data science" /><summary type="html"><![CDATA[This blog was published in BlueSci - the Cambridge University Science Magazine. The original can be viewed here.]]></summary></entry><entry><title type="html">States Of Mind: History in Review</title><link href="https://liza-tennant.github.io/posts/2017/01/states-of-mind/" rel="alternate" type="text/html" title="States Of Mind: History in Review" /><published>2017-01-25T00:00:00+00:00</published><updated>2017-01-25T00:00:00+00:00</updated><id>https://liza-tennant.github.io/posts/2017/01/states-of-mind</id><content type="html" xml:base="https://liza-tennant.github.io/posts/2017/01/states-of-mind/"><![CDATA[<p>This blog was originally published on the Bedford Bugle - the University College London Psychology Society’s blog. The original can be viewed <a href="https://bedfordbugle.wordpress.com/2017/01/25/states-of-mind-history-in-review/">here</a>.</p>

<hr />

<p>I published a blog on the Bedford Bugle - the University College London Psychology Society’s blog. This blog was a review of the ‘States of Mind’ exhibition at the Wellcome Collection. The exhibition brought together works of artists, psychologists, philosophers and neuroscientists, exploring phenomena such as somnambulism (sleepwalking), synaesthesia (a sensation in one of the senses, such as hearing, triggering a sensation in another, such as taste) and memory disorders, interrogating our understanding of the conscious experience.</p>]]></content><author><name>Liza Tennant</name></author><category term="cognitive science" /><category term="neuroscience" /><category term="mental health" /><category term="philosophy" /><summary type="html"><![CDATA[This blog was originally published on the Bedford Bugle - the University College London Psychology Society’s blog. The original can be viewed here.]]></summary></entry><entry><title type="html">‘Does Social Science Tell the Truth?’- My First Ever Lunch Hour Lecture</title><link href="https://liza-tennant.github.io/posts/2016/01/replication-crisis/" rel="alternate" type="text/html" title="‘Does Social Science Tell the Truth?’- My First Ever Lunch Hour Lecture" /><published>2016-01-11T00:00:00+00:00</published><updated>2016-01-11T00:00:00+00:00</updated><id>https://liza-tennant.github.io/posts/2016/01/replication-crisis</id><content type="html" xml:base="https://liza-tennant.github.io/posts/2016/01/replication-crisis/"><![CDATA[<p>This blog was originally published on the Bedford Bugle - the University College London Psychology Society’s blog. The original can be viewed <a href="https://bedfordbugle.wordpress.com/2016/11/01/does-social-science-tell-the-truth-my-first-ever-lunch-hour-lecture/">here</a>.</p>

<hr />

<p>Having only been here for three weeks I already realised that UCL (or university in general, most likely) is the sort of place where fascinating additional learning opportunities happen every day, but if you want to be part of them you have to research them yourself. By pure chance I found out that UCL does free 40-minute lunch hour lectures. And not only are those open to both postgrad super-minds/PhD-holding geniuses, but also the lost-and-flustered-looking freshers like myself. If you’re someone interested in such a hidden gem, but were unable to attend, or would simply like to hear my much-less-professional re-telling of the story, read on!</p>

<p>This lecture was given by Prof David Shanks, Head of Division of Psychology and Language Sciences, and the topic (‘Does Social Science Tell the Truth?’) was inspired by the story of Amy J.C. Cuddy. In 2010 Amy carried out an experiment with Dana Carney and Andy Yap, which was later titled ‘Power Posing: Brief Nonverbal Displays Affect Neurodoctrine Levels and Risk Tolerance’ and published in Psychological Science, ‘the highest ranked empirical journal in psychology’ (quote from <a href="http://pss.sagepub.com">http://pss.sagepub.com</a>, article can be found <a href="http://pss.sagepub.com/content/21/10/1363.abstract">here</a>). The experiment apparently proved that ‘power posing’ (i.e. standing in a confident and open pose for 2 minutes) increased testosterone levels in participants’ bodies and even caused better performance in job interviews. This idea grabbed the attention of both scientists and businessmen, bringing Ms Cuddy immense popularity online. Cuddy gave a TED talk on power posing, which was viewed by 35 million people and was even turned into a book. Ms Cuddy’s career sort of built itself around power posing, but was it a legitimate theory…?</p>

<p>Only 4 years later, an identical experiment involving 200 participants failed to replicate Cuddy’s results. According to the researchers conducting this new experiment, only subjective feelings of power were strengthened in participants by the so-called ‘power posing’, but no adrenaline level rise occurred. Some further digging was done and the 33 previously published studies on power posing, which Cuddy and co. presented as support for their argument, were analysed statistically and were proven to show no overall effect of power posing on anything mentioned above. How can we believe psychologists after this?</p>

<p>Well, a scandalous story of such scale did have positive effects: it encouraged the Psychological world to create a resource that would help avoid future scandals like this. <a href="http://psychfiledrawer.org">PsychFileDrawer</a> was created for researchers to upload their raw data (results) onto and for learners to be able to see the replication attempts of almost all the experiments ever conducted in Experimental Psychology. It is believed that this resource has made the field of Experimental Psychology more truthful and stopped many researchers from faking ‘significance’. Many websites now also allocate ‘badges’ to their articles to indicate whether the author has uploaded their data online and whether the following study can be believed fully.</p>

<p>I’m worried that I may be summarising too much of the lecture for you, but I genuinely found this bit fascinating and thought that my readers deserved to be enlightened. By providing Amy Cuddy’s story and many other examples of non-replicable experiments (which were all very interesting and you should definitely check out the video, link below) Prof Shanks really made us all re-think the way we look at social science.</p>

<p>Although social science doesn’t always ‘tell the truth’, it’s becoming more and more honest every day, especially with the contribution of new statistical analyses, greater availability of study-replication materials on the Internet, openly published data and the above-mentioned publication ‘badges’. Prof Shanks thankfully didn’t give us a prescribed answer to the title question, but by the end of the lecture I was convinced that today we are able to claim that social science is getting closer to ‘telling the truth’ than it was 6 years ago. However, the difficulty of working with people persists to this day and is left in the hands of future scientists, perhaps some of those currently reading this very non-scientific review[i]. Overall, by the end of the lecture I felt enlightened and would recommend future lunch hour lectures to anyone and everyone.</p>

<p>[i] For a more scientific version, i.e. the recording of the lecture itself, please go to: <a href="https://www.youtube.com/watch?v=Jt7gEAoUl8s">https://www.youtube.com/watch?v=Jt7gEAoUl8s</a></p>]]></content><author><name>Liza Tennant</name></author><category term="replication crisis" /><category term="social sicence" /><category term="psychology" /><summary type="html"><![CDATA[This blog was originally published on the Bedford Bugle - the University College London Psychology Society’s blog. The original can be viewed here.]]></summary></entry></feed>