Axiology first: why AI alignment needs better conversations
If you work on AI alignment, you already know the recurring conceptual bottleneck: even when we can train powerful optimizers, we still struggle to answer—cleanly and operationally—questions like “what counts as doing well?” In alignment terms, this maps closely to outer alignment: specifying the right target, rather than a seductive proxy.
Note: In alignment it’s useful to distinguish three “objectives”: W, the real or normative objective (what we ultimately want in the world according to our values, even if plural and uncertain); O, the specified objective (the training signal or evaluation criterion designers formalize and implement—reward, loss, preference labels, rules, etc.); and I, the learned internal objective (the effective criterion the system ends up pursuing when choosing actions). The difficulty is that W must be translated into O without falling for seductive proxies, and training must then yield an I that actually corresponds to O.
My (fallible, but actionable) bet is that this bottleneck is partly axiological: we don’t just need better techniques; we need progress in value theory—how to compare values in conflict, how to decide under moral uncertainty, and how to coordinate under deep disagreement. And that is where Unbiased Machine fits: not as “yet another moral stance,” but as a set of techniques for making hard discussions produce movement rather than polarization.
The claim in one sentence
Without progress in axiology (and in how we talk about axiology), alignment lacks a stable specification.
This may not be the whole story of alignment, but it looks central: alignment is, in large part, about aligning systems with human objectives/values—and those are difficult to specify, easy to proxy, and contested.
Why axiology is alignment infrastructure
Axiology is, literally, the study of value: what is good, what matters, and how goods trade off.
In alignment, this reappears as technical pressure:
Multiple legitimate objectives in tension (welfare vs rights, autonomy vs safety, fairness vs efficiency, etc.).
Moral uncertainty (we don’t know which moral theory is correct—if any single one is).
Aggregation and disagreement (there is no “the human”; there are many humans with conflicting values).
Even approaches that try to sidestep explicit value specification—e.g., treating human preferences as the target while remaining uncertain about them and learning from behavior—don’t remove axiology. They relocate it: what counts as evidence, whose preferences matter, how conflicts are resolved, what trade-offs are acceptable.
The practical obstacle: not lack of intelligence, but social cognition
In the abstract, we expect “better arguments” to move beliefs. In practice, on identity-loaded topics (religion, politics, existential risk, moral status), the direct path can backfire: more reasons → more defense → less plasticity.
This matters for alignment because axiology is, by definition, identity-adjacent. Many moral commitments function as existential anchors, not as easily swappable hypotheses.
The circle I want to close with Unbiased Machine
I’ve described three mechanisms for discussions where “being right” doesn’t help (and can make things worse):
The elephant in the room of rationalism: the direct path to truth doesn’t always work.
https://unbiasedmachine.com/the-elephant-in-the-room-of-rationalism-the-direct-path-to-truth-doesnt-always-work/The single-layer sandwich: a format where you only verbalize what is valuable in the other person’s contribution, to reduce defensiveness and open exploration.
https://unbiasedmachine.com/the-single-layer-sandwich/Gentle feedback by meta-levels: when there is nothing positive to say at the object level, you search for something salvageable at higher levels (intent, method, precision, epistemic courage, clarity of framing, etc.).
https://unbiasedmachine.com/when-theres-nothing-positive-to-say-the-art-of-gentle-criticism-by-meta-levels/
My thesis for this fourth piece is that these techniques are not decorative “soft skills.” They are epistemic infrastructure for the part of the map where we get stuck hardest.
Because if alignment requires (i) specifying what success means and (ii) doing so robustly against proxies, then we need real progress in axiological debate—not only technical papers. And that progress will not happen if conversations reliably trigger tribal defense.
Why this should matter if you’re doing alignment research
Alignment already knows the proxy problem: systems optimize what you measure, and they exploit gaps between what you meant and what you specified.
My proposed analogy is: human communities also optimize conversational proxies.
“Winning” substitutes for understanding.
Signaling moral membership substitutes for resolving trade-offs.
Discrediting substitutes for refinement.
Closure substitutes for uncertainty reduction.
If axiology is where disagreement is deepest, then conversational failure becomes an upstream specification failure: you end up with vague goals, implicit values, and entrenched disputes that later reappear as “technical disagreement” (which data, which feedback, which deployment policy, which harms count).
A minimal, practical protocol for axiological discussions
Think of this as a tentative toolkit rather than an ethic: a few conversational moves designed to make disagreements more structurally constructive.
Step 1: State the goal
“I’m not trying to convince you; I’m trying to understand what would have to be true for your position to be correct.”
Step 2: Start with what’s genuinely strong (single-layer sandwich)
One sentence identifying a real strength (definition, attention to externalities, internal consistency, willingness to accept implications). This lowers threat.
Step 3: If the object-level is irreconcilable, move up (meta-level feedback)
“I like that you’re optimizing for X even if I think it fails on Y.”
“Your frame captures a risk mine often ignores.”
Step 4: Convert disagreement into structural questions
Instead of “your theory is wrong”:
What evidence would move you?
What trade-off are you explicitly accepting?
What are you prioritizing: suffering, rights, agency, fairness, dignity?
How do you aggregate across individuals?
Step 5: Close with calibrated uncertainty
“I still disagree, but now I know exactly where.”
This won’t guarantee convergence. It guarantees something more useful: the conversation produces structure, and disagreement stops being an identity collision and becomes a map of assumptions.
The invitation
If alignment needs axiology, and axiology needs discussions that don’t fracture around identity, then these techniques are not peripheral. They’re part of the pipeline.
The aim of https://unbiasedmachine.com/ is to build a conversation toolkit for high-load disputes (sentience, existential risk, transhumanism, AI governance), where progress is not “who won,” but “which new distinctions became available.”
