How difficult is AI alignment? | Anthropic Research Salon
At an Anthropic Research Salon event in San Francisco, four of our researchers—Alex Tamkin, Jan Leike, Amanda Askell and Josh Batson—discussed alignment science, interpretability, and the future of AI research.
Further reading:
Anthropic’s research: https://anthropic.com/research
Claude’s character: https://www.anthropic.com/news/claude-character
Evaluating feature steering: https://www.anthropic.com/research/evaluating-feature-steering
0:00 Introduction
0:30 An overview of alignment
4:48 Challenges of scaling
8:08 Role of interpretability
12:02 How models can help
14:31 Signs of whether alignment is easy or hard
18:28 Q&A — Multi-agent deliberation
20:38 Q&A — Model alignment epiphenomenon
23:43 Q&A — What solving alignment could look like