feat: let the optimizer disable MLP ablation via a 0 max_weight floor (#387)

* feat: let the optimizer disable MLP ablation via a 0 max_weight floor

The MLP max_weight lower bound was 0.8 for every component, so the optimizer
always applied at least 0.8x MLP ablation and could never turn it off, even
when ablating the MLP is pure collateral damage. Give the MLP a 0 lower bound
so the optimizer can disable it per model; attention keeps the 0.8 floor.

See #202.

* perf: skip the abliteration decomposition when the weight is 0

With a 0 max_weight the component's ablation is a no-op, and reset_model()
has already left the adapter at identity. Abort that layer/component before
the decomposition, which avoids the wasted work (and the degenerate
zero-matrix decomposition raised in review on #387).

* fix: clamp a negative MLP max_weight floor so 0 is reachable

A continuous suggest_float never samples exactly 0, so a 0 lower bound could
not actually disable the MLP. Use a small negative lower bound and clamp with
max(0, ...), which puts finite probability mass on exactly 0.
This commit is contained in:
Rocker Zhang
2026-06-18 16:14:45 +08:00
committed by GitHub
parent 554a58aa0f
commit 00185db9fc
2 changed files with 22 additions and 4 deletions

View File

@@ -578,10 +578,22 @@ def run():
# The parameter ranges are based on experiments with various models
# and much wider ranges. They are not set in stone and might have to be
# adjusted for future models.
max_weight = trial.suggest_float(
f"{component}.max_weight",
0.8,
1.5,
#
# The MLP gets a negative lower bound that is then clamped to 0, so the
# optimizer can fully disable its ablation. The clamp puts a positive
# probability mass on exactly 0 (the continuous sampler would otherwise
# reach 0 with probability zero). Ablating the MLP is often unnecessary for
# removing refusals and tends to damage model intelligence more than
# ablating the attention output, so on many models the optimum is to leave
# it (mostly) untouched. See issue #202.
max_weight_lower_bound = -0.25 if component == "mlp.down_proj" else 0.8
max_weight = max(
0.0,
trial.suggest_float(
f"{component}.max_weight",
max_weight_lower_bound,
1.5,
),
)
max_weight_position = trial.suggest_float(
f"{component}.max_weight_position",

View File

@@ -499,6 +499,12 @@ class Model:
params.min_weight - params.max_weight
)
# A weight of 0 disables this component's ablation. reset_model() has
# already left the adapter at identity, so abort before the otherwise
# wasteful decomposition (which would also be operating on a zero matrix).
if weight == 0:
continue
if refusal_direction is None:
# The index must be shifted by 1 because the first element
# of refusal_directions is the direction for the embeddings.