FormalRx: Rectify and eXamine Semantic Failures in Autoformalization

Figure 1. An illustrative comparison with prior semantic evaluation methods for autoformalization. On this misaligned candidate, every FL–FL and NL–FL baseline either scores it as aligned or confirms only syntactic correctness, whereas FormalRx returns the verdict, error category, error location, and a corrected formal statement.

TL;DR. Existing autoformalization evaluators give you a binary verdict and stop there. Such a coarse signal carries no fine-grained information about what failed, where it occurred, or how to repair it.

FormalRx is a diagnostic evaluation framework that transforms this coarse signal into actionable feedback. In a single forward pass, it provides: 1) alignment verdicts, 2) error categorization over a 28-class SCI taxonomy, 3) error localization, and 4) correction.

Live · ICML 2026 Workshop Challenge

AI4Math @ ICML 2026 · Track 1: FormalRx

Semantic Alignment Evaluation for Autoformalization. Built on the SCI Error Taxonomy. Submit your diagnostic model and climb the leaderboard.

May 1 to June 15, 2026 (AoE) · $1,000 / $600 / $400 for top 3

Join Challenge Leaderboard

The veracious semantic alignment in autoformalization is significant for formal mathematical reasoning. However, existing evaluations provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement.

To bridge this gap, we introduce FormalRx, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from black-box judgments into actionable feedback. At its core is the SCI Error Taxonomy, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRx provides four critical diagnostic capabilities: alignment verdicts, error categorization, error localization, and correction. We instantiate the framework with a diagnostic model FormalRx-8B, trained on 56,287 NL–FL pairs with fine-grained diagnostic annotations, and release FormalRx-Test as the first fine-grained diagnostic benchmark.

The SCI taxonomy partitions autoformalization errors into three disjoint, exhaustive dimensions: Semantic, Constraint, and Implementation. Location-based catch-all categories ensure exhaustiveness, and a priority ordering resolves cases that match more than one.

We synthesize 56,287 annotated NL–FL pairs by taxonomy-guided error injection from 17,825 aligned seed pairs. An LLM agent introduces a targeted error for each applicable SCI category, a second LLM validates and re-tags, and candidates failing Lean compilation or the label check are discarded. Each surviving instance carries full diagnostic metadata: error category, location, reasoning, and corrected statement.

We release FormalRx-Test on Hugging Face: 7,030 natural-language / Lean 4 statement pairs annotated under the SCI Error Taxonomy. Each sample carries a natural-language problem, a candidate Lean 4 formalization, and a Lean header required to type-check the candidate.

7,030

Total samples

1,791

Aligned (25.5%)

5,239

Misaligned (74.5%)

28

SCI categories covered

Heads up. Inputs only are released for now. Ground-truth diagnoses stay withheld until the ICML 2026 AI for Math Workshop & Challenge 1 closes (June 15, AoE). The fully labeled split will accompany the paper after the challenge ends.

Schema

Each sample is a JSON object:

{
  "idx":                "formalrx_1",
  "header":             "import Mathlib\nopen Real Set",
  "informal_statement": "...natural language problem...",
  "formal_statement":   "theorem ... := by sorry",
  "diagnosis": {
    "aligned":             "Aligned" | "Misaligned",
    "error_type":          "<one of 28 SCI categories>" | "N/A",
    "error_location":      "<minimal code snippet>" | "N/A",
    "corrected_statement": "<complete corrected Lean 4 statement>" | "N/A"
  }
}

0.881

Verdict F1

0.709

Categorization F1

0.750

Localization Acc.

0.729

Correction Acc.

On FormalRx-Test (7,030 samples), FormalRx-8B outperforms the strongest frontier baselines on every task: +23.4 pp over the best baseline on 28-class error categorization, while staying competitive on verdict prediction where base models already perform well.

Table 1. Main results on FormalRx-Test (7,030 samples). Underline marks the best baseline within each model category; bold + underline marks the overall best. FormalRx-8B attains the highest score on every task; the categorization gap (+23.4 pp over the strongest baseline) reflects the value of taxonomy-grounded supervision.

Method	Verdict (F1)	Categorization (F1)	Localization (Acc)	Correction (Acc)
Frontier Models
DeepSeek-v3.2	0.858	0.259	0.407	0.592
GPT-4.1	0.810	0.269	0.609	0.579
GPT-5-mini	0.877	0.397	0.605	0.605
GPT-5.2	0.858	0.311	0.565	0.578
GPT-5.3-codex	0.853	0.380	0.685	0.688
Claude-Sonnet-4	0.869	0.376	0.582	0.595
Claude-Sonnet-4.6	0.880	0.475	0.739	0.714
Instruct Models
Qwen3-8B	0.833	0.155	0.398	0.458
Qwen3-4B	0.854	0.119	0.331	0.354
Qwen3-1.7B	0.843	0.092	0.350	0.331
Specialized Alignment Metrics
LeanScorer	0.632	n/a	n/a	n/a
BLEU (best F1)	0.404	n/a	n/a	n/a
FormalRx-8B (Ours)	0.881	0.709	0.750	0.729

Scaling Across Model Sizes

Table 2. Performance across three FormalRx sizes. Verdict and categorization are well-learned even at 1.7B parameters, while the generative tasks (localization and correction) benefit most from scale: FormalRx-8B wins on both. The non-monotonic dip on categorization at 4B is consistent with the paper's observation that 28-class classification is sensitive to factors beyond raw scale.

Method	Verdict (F1)	Categorization (F1)	Localization (Acc)	Correction (Acc)
FormalRx-1.7B	0.882	0.725	0.739	0.708
FormalRx-4B	0.889	0.508	0.713	0.654
FormalRx-8B	0.881	0.709	0.750	0.729

@inproceedings{wang2026formalrx,
  title     = {FormalRx: Rectify and eXamine Semantic Failures in Autoformalization},
  author    = {Wang, Haocheng and Huang, Baiyu and Wan, Yingjia and Zhu, Xiao and Liu, Xiaoyang and Huang, Yinya and Guo, Zhijiang},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
}

FormalRx: Rectify and eXamine Semantic Failures in Autoformalization

AI4Math @ ICML 2026 · Track 1: FormalRx

Abstract

The SCI Error Taxonomy

Data Synthesis Pipeline

FormalRx-Test Dataset

Schema

Main Results on FormalRx-Test

Scaling Across Model Sizes

BibTeX