FormalRx: Rectify and eXamine Semantic Failures in Autoformalization

Haocheng Wang*1,4, Baiyu Huang*1, Yingjia Wan*2,
Xiao Zhu1, Xiaoyang Liu5, Yinya Huang†3, Zhijiang Guo†1,6
1HKUST(GZ)   2UCLA   3ETH AI Center   4ETH Zürich   5SJTU   6HKUST
*Equal contribution   Corresponding authors
ICML 2026
FormalRx overview

Figure 1. An illustrative comparison with prior semantic evaluation methods for autoformalization. On this misaligned candidate, every FL–FL and NL–FL baseline either scores it as aligned or confirms only syntactic correctness, whereas FormalRx returns the verdict, error category, error location, and a corrected formal statement.

TL;DR. Existing autoformalization evaluators give you a binary verdict and stop there. Such a coarse signal carries no fine-grained information about what failed, where it occurred, or how to repair it.

FormalRx is a diagnostic evaluation framework that transforms this coarse signal into actionable feedback. In a single forward pass, it provides: 1) alignment verdicts, 2) error categorization over a 28-class SCI taxonomy, 3) error localization, and 4) correction.

Live · ICML 2026 Workshop Challenge

AI4Math @ ICML 2026 · Track 1: FormalRx

Semantic Alignment Evaluation for Autoformalization. Built on the SCI Error Taxonomy. Submit your diagnostic model and climb the leaderboard.

May 1 to June 15, 2026 (AoE) · $1,000 / $600 / $400 for top 3

The veracious semantic alignment in autoformalization is significant for formal mathematical reasoning. However, existing evaluations provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement.

To bridge this gap, we introduce FormalRx, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from black-box judgments into actionable feedback. At its core is the SCI Error Taxonomy, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRx provides four critical diagnostic capabilities: alignment verdicts, error categorization, error localization, and correction. We instantiate the framework with a diagnostic model FormalRx-8B, trained on 56,287 NL–FL pairs with fine-grained diagnostic annotations, and release FormalRx-Test as the first fine-grained diagnostic benchmark.

The SCI taxonomy partitions autoformalization errors into three disjoint, exhaustive dimensions: Semantic, Constraint, and Implementation. Location-based catch-all categories ensure exhaustiveness, and a priority ordering resolves cases that match more than one.

SCI taxonomy

Figure 2. Overview of the SCI taxonomy. 28 error categories are partitioned into three disjoint, exhaustive dimensions: Semantic (S), Constraint (C), and Implementation (I). Location-based catch-all categories (Premise, Conclusion, Auxiliary Construction) ensure exhaustiveness, and a priority ordering resolves cases that match more than one.

We synthesize 56,287 annotated NL–FL pairs by taxonomy-guided error injection from 17,825 aligned seed pairs. An LLM agent introduces a targeted error for each applicable SCI category, a second LLM validates and re-tags, and candidates failing Lean compilation or the label check are discarded. Each surviving instance carries full diagnostic metadata: error category, location, reasoning, and corrected statement.

Data synthesis pipeline

Figure 3. Overview of the data synthesis pipeline. From 17k aligned NL–FL seed pairs, an LLM agent applies taxonomy-guided error injection to produce misaligned candidates together with raw diagnostic labels. A second LLM then validates and re-tags each candidate against the SCI taxonomy; samples failing Lean compilation or the label check are discarded, leaving 56k annotated instances.

HF Dataset Downloads Likes ICML 2026 Challenge License

We release FormalRx-Test on Hugging Face: 7,030 natural-language / Lean 4 statement pairs annotated under the SCI Error Taxonomy. Each sample carries a natural-language problem, a candidate Lean 4 formalization, and a Lean header required to type-check the candidate.

7,030
Total samples
1,791
Aligned (25.5%)
5,239
Misaligned (74.5%)
28
SCI categories covered
Heads up. Inputs only are released for now. Ground-truth diagnoses stay withheld until the ICML 2026 AI for Math Workshop & Challenge 1 closes (June 15, AoE). The fully labeled split will accompany the paper after the challenge ends.

Schema

Each sample is a JSON object:

{
  "idx":                "formalrx_1",
  "header":             "import Mathlib\nopen Real Set",
  "informal_statement": "...natural language problem...",
  "formal_statement":   "theorem ... := by sorry",
  "diagnosis": {
    "aligned":             "Aligned" | "Misaligned",
    "error_type":          "<one of 28 SCI categories>" | "N/A",
    "error_location":      "<minimal code snippet>" | "N/A",
    "corrected_statement": "<complete corrected Lean 4 statement>" | "N/A"
  }
}
0.881
Verdict F1
0.709
Categorization F1
0.750
Localization Acc.
0.729
Correction Acc.

On FormalRx-Test (7,030 samples), FormalRx-8B outperforms the strongest frontier baselines on every task: +23.4 pp over the best baseline on 28-class error categorization, while staying competitive on verdict prediction where base models already perform well.

Table 1. Main results on FormalRx-Test (7,030 samples). Underline marks the best baseline within each model category; bold + underline marks the overall best. FormalRx-8B attains the highest score on every task; the categorization gap (+23.4 pp over the strongest baseline) reflects the value of taxonomy-grounded supervision.

Method Verdict (F1) Categorization (F1) Localization (Acc) Correction (Acc)
Frontier Models
DeepSeek-v3.20.8580.2590.4070.592
GPT-4.10.8100.2690.6090.579
GPT-5-mini0.8770.3970.6050.605
GPT-5.20.8580.3110.5650.578
GPT-5.3-codex0.8530.3800.6850.688
Claude-Sonnet-40.8690.3760.5820.595
Claude-Sonnet-4.60.8800.4750.7390.714
Instruct Models
Qwen3-8B0.8330.1550.3980.458
Qwen3-4B0.8540.1190.3310.354
Qwen3-1.7B0.8430.0920.3500.331
Specialized Alignment Metrics
LeanScorer0.632n/an/an/a
BLEU (best F1)0.404n/an/an/a
FormalRx-8B (Ours) 0.881 0.709 0.750 0.729

Scaling Across Model Sizes

Table 2. Performance across three FormalRx sizes. Verdict and categorization are well-learned even at 1.7B parameters, while the generative tasks (localization and correction) benefit most from scale: FormalRx-8B wins on both. The non-monotonic dip on categorization at 4B is consistent with the paper's observation that 28-class classification is sensitive to factors beyond raw scale.

Method Verdict (F1) Categorization (F1) Localization (Acc) Correction (Acc)
FormalRx-1.7B0.8820.7250.7390.708
FormalRx-4B0.8890.5080.7130.654
FormalRx-8B 0.881 0.709 0.750 0.729
@inproceedings{wang2026formalrx,
  title     = {FormalRx: Rectify and eXamine Semantic Failures in Autoformalization},
  author    = {Wang, Haocheng and Huang, Baiyu and Wan, Yingjia and Zhu, Xiao and Liu, Xiaoyang and Huang, Yinya and Guo, Zhijiang},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
}