Issue 5: Blockchain Transparency and AI Interpretability

What Blockchain Transparency Teaches Us About AI Interpretability

Feb 08, 2026

Through TrailBit Labs, I research the heuristics that analysts use to trace transactions on Bitcoin’s blockchain — methods like common-input-ownership, change address detection, and timing analysis. The question driving this work is simple: when we claim to know who sent bitcoin to whom, how confident should we be?

It’s a forensics question. But the deeper I’ve gone into it, the more I’ve realized it’s the same question now facing AI interpretability — a field I initially thought had nothing to do with my work.

Both are attempts to make opaque systems transparent. And both are learning, often painfully, that transparency is harder than it looks.

The Illusion of Transparency

Bitcoin is often described as transparent. Every transaction is recorded on a public ledger. Anyone can look up any address, trace any payment, follow any flow of funds. Compared to traditional banking, where transactions are hidden behind institutional walls, this feels radically open.

But transparency of data is not the same as transparency of meaning.

Yes, you can see that address 1A1zP1... sent 0.5 BTC to address 3J98t1.... But who controls those addresses? Are they the same person moving funds between wallets? Is one of them an exchange? Is the transaction a payment, a consolidation, or an attempt to obscure a trail?

The raw data tells you nothing about intent, identity, or structure. To extract meaning, analysts rely on heuristics — rules of thumb like “inputs to the same transaction probably belong to the same entity” (the Common-Input-Ownership Heuristic) or “the smaller output is probably the change going back to the sender.” These heuristics work often enough to be useful. But they also fail in ways that are hard to detect and easy to exploit.

This is the core tension in blockchain forensics: the system is transparent in the sense that all data is visible, but opaque in the sense that the data’s meaning is not self-evident. Interpretation requires assumptions, and assumptions can be wrong.

AI has exactly the same problem.

Neural Networks Are Public Ledgers Too

A neural network’s weights are, in principle, fully inspectable. Every parameter, every connection, every activation — it’s all there in the model files. You can download an open-source LLM and examine every floating-point number that defines it.

But just as a list of Bitcoin transactions doesn’t tell you who is paying whom, a matrix of neural network weights doesn’t tell you how the model reasons, what it has learned, or why it produces a particular output.

The field of mechanistic interpretability — pioneered in large part by researchers at Anthropic, among others — is essentially trying to do for neural networks what blockchain analysts try to do for transaction graphs: reverse-engineer higher-level meaning from lower-level data.

In blockchain forensics, we cluster addresses into entities. In mechanistic interpretability, researchers identify circuits — groups of neurons that activate together to perform a recognizable function. In both cases, the goal is to move from raw, unintelligible data to a structured understanding of what the system is actually doing.

And in both cases, the methods are heuristic. They work until they don’t. The question is always: how do you know when they’ve stopped working?

Heuristics All the Way Down

This is where my blockchain research has most shaped how I think about AI.

In Bitcoin forensics, the Common-Input-Ownership Heuristic (CIOH) is treated as foundational. Most clustering tools and forensics platforms rely on it. Entire compliance regimes are built on top of it. But when you actually test it empirically — when you look at how often it holds true across different transaction types, time periods, and network conditions — you find it’s far less reliable than the industry assumes.

CoinJoin transactions deliberately break CIOH. Multi-party protocols like PayJoin make it impossible to distinguish collaborative transactions from simple payments. Even ordinary transactions can violate the heuristic when users consolidate funds from multiple sources.

The heuristic isn’t wrong. It’s just not as right as people treat it. And the gap between “usually works” and “always works” is where innocent people get flagged and guilty ones slip through.

AI interpretability faces the same risk. When researchers identify a “truthfulness direction” in a model’s activation space, or locate a circuit that appears to handle a specific task, the temptation is to treat these findings as ground truth. But neural networks are vast, and the features we identify may be approximate, context-dependent, or entangled with other behaviors we haven’t mapped yet.

The parallel isn’t just conceptual. It’s structural:

In blockchain: We identify patterns (heuristics) → treat them as reliable → build systems on top of them → discover edge cases where they fail → scramble to patch.

In AI: We identify features (circuits, directions) → treat them as reliable → build alignment tools on top of them → discover edge cases where they fail → scramble to patch.

The lesson from blockchain is that the scrambling phase is inevitable, and it’s much less painful if you’ve been honest about your confidence levels from the start.

Adversarial Pressure Reveals Fragility

Both systems also share a critical dynamic: the presence of adversaries who are actively trying to break your interpretations.

In Bitcoin, privacy-preserving techniques like CoinJoin, atomic swaps, and chain-hopping exist specifically to defeat forensic heuristics. Every time analysts develop a new clustering method, privacy advocates develop countermeasures. The result is an arms race where the reliability of any given heuristic degrades over time as the network adapts.

In AI, adversarial attacks, jailbreaks, and prompt injections play a similar role. They probe the boundaries of what we think we understand about a model’s behavior and reveal that our interpretations were more fragile than we assumed.

This adversarial pressure is actually valuable — it’s the fastest way to find out where your understanding is incomplete. In blockchain forensics, the existence of CoinJoin forced analysts to develop better methods and be more honest about uncertainty. In AI, jailbreaks and adversarial examples are pushing interpretability researchers to develop more robust theories of how models actually work, rather than settling for surface-level explanations.

But it only works if you treat the failures as information rather than embarrassments.

What “Transparency” Actually Requires

Working across both domains has convinced me that real transparency isn’t about making data visible. It’s about three things:

1. Validated methods, not assumed ones.

In blockchain forensics, this means empirically testing heuristics before building compliance systems on top of them. In AI interpretability, it means rigorously validating that the features and circuits you’ve identified actually correspond to the model behaviors you think they do — not just in controlled experiments, but in deployment conditions.

2. Calibrated confidence.

Not every finding deserves equal trust. When a blockchain analyst says “these addresses belong to the same entity,” they should be forced to say with what confidence and based on which heuristics. A cluster identified through CIOH alone, across a handful of transactions, is not the same as one corroborated by timing analysis, change detection, and known exchange patterns. Both are “findings.” They are not equally reliable.

The same standard should apply in AI interpretability. When a researcher says “this circuit handles deception detection,” the field needs a way to communicate not just what was found, but how robust that finding is — across how many contexts, under what conditions, and with what known limitations. Without that, we’re building safety tooling on findings we haven’t properly stress-tested.

3. Honest acknowledgment of what we don’t know.

The most dangerous state in both fields isn’t ignorance — it’s false confidence. A blockchain analyst who says “we can trace any transaction” is more dangerous than one who says “we can trace most transactions, but here are the cases where we can’t.” An AI safety researcher who claims full understanding of a model’s decision-making is more dangerous than one who maps what they understand and clearly marks the boundaries.

Building Responsibly in Opaque Systems

I didn’t plan for these interests to converge. I started building TrailBit because I was genuinely curious about Bitcoin’s transaction graph — how much you could actually infer from public data, and where the limits were. The connections to AI interpretability only became clear after I’d spent enough time with the forensics work to recognize the same epistemological patterns showing up in a completely different field.

But the convergence isn’t accidental. If you care about building tools people can trust, you end up asking the same questions regardless of the domain:

What assumptions am I making? How would I know if they were wrong? What happens to the people relying on my tool if my assumptions fail?

In Bitcoin forensics, wrong assumptions mean innocent people get flagged for money laundering. In AI interpretability, wrong assumptions mean we think a model is safe when it isn’t. The stakes differ in kind. The methodology for avoiding those failures shouldn’t.

Where This Goes

I don’t think blockchain forensics will solve AI interpretability or vice versa. They’re different domains with different technical constraints. But I do think practitioners in both fields would benefit from talking to each other more — not about the specific techniques, but about the epistemological challenges they share.

How do you validate an interpretation of a system that’s too complex to fully understand? How do you communicate uncertainty to stakeholders who want definitive answers? How do you build responsibly on top of methods you know are imperfect?

These are the questions I keep coming back to, whether I’m tracing Bitcoin transactions or reading the latest interpretability research. I suspect they’re the questions that will define how well we navigate the next decade of AI development.

If you’re working on either of these problems — blockchain analytics, AI interpretability, or just trying to build things that handle uncertainty honestly — I’d love to hear how you think about them.

Bitcoin Heuristics Field Notes

Discussion about this post

Ready for more?