Mechanistic Interpretability

Inspect and steer the internal workings of language models.

Attach Inspectors to capture per-token activations and optionally steer (add/ablate directions) to analyze or intervene in model behavior. This allows for a deeper understanding of why a model makes certain decisions and provides a mechanism to influence its behavior directly.

Interpretability Features

  • Activation Inspection: Capture per-token activations from any specified layer of a model during generation.
  • Model Steering: Intervene in the model's forward pass by adding or ablating activation vectors, effectively steering its behavior towards or away from certain concepts.
  • Composable Interface: Inspectors can be seamlessly attached to agents using the `|` operator, making it easy to add interpretability to any agent.

Usage Example


import sdialog
from sdialog.interpretability import Inspector
from sdialog.agents import Agent
import torch

sdialog.config.llm("huggingface:meta-llama/Llama-3.2-3B-Instruct")

# Create an agent and an inspector targeting a specific layer
agent = Agent(name="Bob")
inspector = Inspector(target="model.layers.16.post_attention_layernorm")

# Attach the inspector to the agent
agent_with_inspector = agent | inspector

# The inspector will now capture activations for all calls to the agent
agent_with_inspector("How are you?")
agent_with_inspector("Cool!")

# Get the last response's first token activation vector
activation_vector = inspector[-1][0].act

# Load a pre-computed direction vector (e.g., for 'anger')
anger_direction = torch.load("anger_direction.pt")

# Steer the agent by subtracting the anger direction from activations
agent_steered = agent | inspector - anger_direction
agent_steered("You are an extremely upset assistant")  # Agent's response will be less angry