feat: let the optimizer disable MLP ablation via a 0 max_weight floor (#387)

* feat: let the optimizer disable MLP ablation via a 0 max_weight floor The MLP max_weight lower bound was 0.8 for every component, so the optimizer always applied at least 0.8x MLP ablation and could never turn it off, even when ablating the MLP is pure collateral damage. Give the MLP a 0 lower bound so the optimizer can disable it per model; attention keeps the 0.8 floor. See #202. * perf: skip the abliteration decomposition when the weight is 0 With a 0 max_weight the component's ablation is a no-op, and reset_model() has already left the adapter at identity. Abort that layer/component before the decomposition, which avoids the wasted work (and the degenerate zero-matrix decomposition raised in review on #387). * fix: clamp a negative MLP max_weight floor so 0 is reachable A continuous suggest_float never samples exactly 0, so a 0 lower bound could not actually disable the MLP. Use a small negative lower bound and clamp with max(0, ...), which puts finite probability mass on exactly 0.
2026-06-24 08:47:51 +00:00 · 2026-06-18 16:14:45 +08:00
parent 554a58aa0f
commit 00185db9fc
2 changed files with 22 additions and 4 deletions
--- a/src/heretic/main.py
+++ b/src/heretic/main.py
@@ -578,10 +578,22 @@ def run():
            # The parameter ranges are based on experiments with various models
            # and much wider ranges. They are not set in stone and might have to be
            # adjusted for future models.
-            max_weight = trial.suggest_float(
-                f"{component}.max_weight",
-                0.8,
-                1.5,
+            #
+            # The MLP gets a negative lower bound that is then clamped to 0, so the
+            # optimizer can fully disable its ablation. The clamp puts a positive
+            # probability mass on exactly 0 (the continuous sampler would otherwise
+            # reach 0 with probability zero). Ablating the MLP is often unnecessary for
+            # removing refusals and tends to damage model intelligence more than
+            # ablating the attention output, so on many models the optimum is to leave
+            # it (mostly) untouched. See issue #202.
+            max_weight_lower_bound = -0.25 if component == "mlp.down_proj" else 0.8
+            max_weight = max(
+                0.0,
+                trial.suggest_float(
+                    f"{component}.max_weight",
+                    max_weight_lower_bound,
+                    1.5,
+                ),
            )
            max_weight_position = trial.suggest_float(
                f"{component}.max_weight_position",
--- a/src/heretic/model.py
+++ b/src/heretic/model.py
@@ -499,6 +499,12 @@ class Model:
                    params.min_weight - params.max_weight
                )

+                # A weight of 0 disables this component's ablation. reset_model() has
+                # already left the adapter at identity, so abort before the otherwise
+                # wasteful decomposition (which would also be operating on a zero matrix).
+                if weight == 0:
+                    continue
+
                if refusal_direction is None:
                    # The index must be shifted by 1 because the first element
                    # of refusal_directions is the direction for the embeddings.