Attach Inspectors to capture per-token activations and optionally steer (add/ablate directions) to analyze or intervene in model behavior. This allows for a deeper understanding of why a model makes certain decisions and provides a mechanism to influence its behavior directly.
Interpretability Features
- Activation Inspection: Capture per-token activations from any specified layer of a model during generation.
- Model Steering: Intervene in the model's forward pass by adding or ablating activation vectors, effectively steering its behavior towards or away from certain concepts.
- Composable Interface: Inspectors can be seamlessly attached to agents using the `|` operator, making it easy to add interpretability to any agent.
Usage Example
import sdialog
from sdialog.interpretability import Inspector
from sdialog.agents import Agent
import torch
sdialog.config.llm("huggingface:meta-llama/Llama-3.2-3B-Instruct")
# Create an agent and an inspector targeting a specific layer
agent = Agent(name="Bob")
inspector = Inspector(target="model.layers.16.post_attention_layernorm")
# Attach the inspector to the agent
agent_with_inspector = agent | inspector
# The inspector will now capture activations for all calls to the agent
agent_with_inspector("How are you?")
agent_with_inspector("Cool!")
# Get the last response's first token activation vector
activation_vector = inspector[-1][0].act
# Load a pre-computed direction vector (e.g., for 'anger')
anger_direction = torch.load("anger_direction.pt")
# Steer the agent by subtracting the anger direction from activations
agent_steered = agent | inspector - anger_direction
agent_steered("You are an extremely upset assistant") # Agent's response will be less angry