## Monosemanticity

**OBJECTIVE:** In this practical session, we will explore feature steering in large language models—a concept introduced by [Anthropic](https://transformer-circuits.pub/2023/monosemantic-features/index.html)—using a pre-trained sparse autoencoder built on OpenAI's GPT-2 (small). The session will demonstrate how manipulating learned internal feature representations can influence the model's behavior.

install the following!

In [None]:
! git clone https://github.com/openai/sparse_autoencoder.git

Before proceeding, remove the `"torch == 2.1.0"` entry from the `pyproject.toml` file. You can do this directly using the Google Colab text editor.


In [None]:
! cd sparse_autoencoder

In [None]:
! pip install ./sparse_autoencoder

In [None]:
import torch
import blobfile as bf
import transformer_lens
from transformer_lens import HookedTransformer, utils
import sparse_autoencoder

**Question 0:** Run the following code to load the models.

In [None]:
# Load gpt2-small
model = HookedTransformer.from_pretrained("gpt2-small").to(torch.float32)
n_layers = model.cfg.n_layers
d_model = model.cfg.d_model
n_heads = model.cfg.n_heads
d_head = model.cfg.d_head
d_mlp = model.cfg.d_mlp
d_vocab = model.cfg.d_vocab


# Load the sparse autoencoder trained on gpt2-small activations
layer_index = 6
location = "resid_post_mlp"
with bf.BlobFile(sparse_autoencoder.paths.v5_32k(location, layer_index), mode="rb") as f:
    state_dict = torch.load(f)
    autoencoder = sparse_autoencoder.Autoencoder.from_state_dict(state_dict)
    autoencoder.to(device)

**Question 1:** Understand the forward hook code that extracts the residual path activations (after an attention and MLP layer) from GPT-2 small.

In [None]:
model.reset_hooks()

# global dict for forward activations
activation_dict = {}

# Define the hook function
def record_activation(activation, hook):
    activation_dict[hook.name] = activation.detach().cpu()

# Register the residual path hook.
model.blocks[6].hook_resid_post.add_hook(record_activation)

**Question 2:** Run inference on the given sentence `input_sentence` using GPT-2 small, and print both the next predicted token and the residual activation recorded by the hook.


In [None]:
input_sentence = "Happiness is"

**Question 3:** Using the recorded activation, do an inference with the autoencoder from Question 0 to reconstruct the activation.

(Hint: use `autoencoder.encode()` and `autoencoder.decode()`. The architecture of `autoencoder` is as described in the lecture slide 332)


**Question 4:** Generate a sentence using argmax decoding by using `model.generate(input_tokens)` with `max_new_tokens=10`, `stop_at_eos=True`, and `do_sample = False` and print the generated sentence.

**Question 5:** Explain the given forward hook that modifies activations instead of only recording them.


In [None]:
model.reset_hooks()
# Define the hook function
def replace_activation(activation, hook):
    activation *= -10
    return activation

# Register the residual path hook.
model.blocks[6].hook_resid_post.add_hook(replace_activation)

**Question 5.bis:** Experiment by altering the activation in various ways, and observe the impact on the generation with context `input_sentence` (as done in Question 4).

**Question 6:** Implement a function named `single_clamping()` that clamps a specified feature (identified by `clamp_feature_index`) from the $32$k features of the autoencoder to a value within the range $[-20, 20]$ (Hint: you should clamp the output of `autoencoder.encode()`). In the notation of lecture slide 332, clamping the feature $j$ means to set $a_j \in [-20, 20]$.

Use the method introduced in the previous question to accomplish this.


**Question 7:** Using the code from Question 4, generate a sentence without providing any input tokens (i.e., by calling `model.generate()` with no context) and apply feature clamping on a specified feature during generation. Print the resulting sentence!


**Question 8:** The previous approach can be vectorized to enable clamping of multiple features in parallel. As a starting point, try generating multiple sentences concurrently using `model.generate()` with input `batched_tokens_tensor` that contains only the start token $50256$ repeated `batch_size` times.


**Question 9:** Implement a batchified version of the `single_clamping()` function from Question 6. In this version, each batch element corresponds to a specific feature index, where the feature to be clamped is determined by the batch number plus a fixed offset (`off`). Then, using the method from the previous question, generate multiple sentences without any input context (i.e., by calling `model.generate()` without tokens), so that each sentence has a different feature clamped according to its batch index (adjusted by `off`). Finally, put the generated sentences into a `.txt` file and save it.

(Hint: you can use `model.tokenizer.batch_decode()` to decode a batch of tokenized sentences.)




**Question 10:** Using the batchified clamping approach from Question 9, explore a selection (e.g., the first 1024 features) of the $32$k features to identify interpretable ones that steer the generation without context. Experiment with various feature indices and observe the resulting outputs to determine which features have a clear impact on the generated sentences.


**Question 11:** Evaluate the interpretable features identified through feature steering by applying clamping when generating the next tokens of `input_sentence`. For instance, if you discovered a feature that induces a 'sad' tone in GPT-2 small, generate the next tokens for a chosen input sentence both with and without clamping that feature, and compare the resulting outputs.



**Bonus Question:** Using the batchified clamping code combined with a for loop, explore all the $32$k features of the autoencoder.

