TL;DR. Existing autoformalization evaluators give you a binary verdict and stop there. Such a coarse signal carries no fine-grained information about what failed, where it occurred, or how to repair it.
FormalRx is a diagnostic evaluation framework that transforms this coarse signal into actionable feedback. In a single forward pass, it provides: 1) alignment verdicts, 2) error categorization over a 28-class SCI taxonomy, 3) error localization, and 4) correction.
Semantic Alignment Evaluation for Autoformalization. Built on the SCI Error Taxonomy. Submit your diagnostic model and climb the leaderboard.
The veracious semantic alignment in autoformalization is significant for formal mathematical reasoning. However, existing evaluations provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement.
To bridge this gap, we introduce FormalRx, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from black-box judgments into actionable feedback. At its core is the SCI Error Taxonomy, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRx provides four critical diagnostic capabilities: alignment verdicts, error categorization, error localization, and correction. We instantiate the framework with a diagnostic model FormalRx-8B, trained on 56,287 NL–FL pairs with fine-grained diagnostic annotations, and release FormalRx-Test as the first fine-grained diagnostic benchmark.
The SCI taxonomy partitions autoformalization errors into three disjoint, exhaustive dimensions: Semantic, Constraint, and Implementation. Location-based catch-all categories ensure exhaustiveness, and a priority ordering resolves cases that match more than one.
We synthesize 56,287 annotated NL–FL pairs by taxonomy-guided error injection from 17,825 aligned seed pairs. An LLM agent introduces a targeted error for each applicable SCI category, a second LLM validates and re-tags, and candidates failing Lean compilation or the label check are discarded. Each surviving instance carries full diagnostic metadata: error category, location, reasoning, and corrected statement.
We release FormalRx-Test on Hugging Face: 7,030 natural-language / Lean 4 statement pairs annotated under the SCI Error Taxonomy. Each sample carries a natural-language problem, a candidate Lean 4 formalization, and a Lean header required to type-check the candidate.
Each sample is a JSON object:
{
"idx": "formalrx_1",
"header": "import Mathlib\nopen Real Set",
"informal_statement": "...natural language problem...",
"formal_statement": "theorem ... := by sorry",
"diagnosis": {
"aligned": "Aligned" | "Misaligned",
"error_type": "<one of 28 SCI categories>" | "N/A",
"error_location": "<minimal code snippet>" | "N/A",
"corrected_statement": "<complete corrected Lean 4 statement>" | "N/A"
}
}
On FormalRx-Test (7,030 samples), FormalRx-8B outperforms the strongest frontier baselines on every task: +23.4 pp over the best baseline on 28-class error categorization, while staying competitive on verdict prediction where base models already perform well.
Table 1. Main results on FormalRx-Test (7,030 samples). Underline marks the best baseline within each model category; bold + underline marks the overall best. FormalRx-8B attains the highest score on every task; the categorization gap (+23.4 pp over the strongest baseline) reflects the value of taxonomy-grounded supervision.
| Method | Verdict (F1) | Categorization (F1) | Localization (Acc) | Correction (Acc) |
|---|---|---|---|---|
| Frontier Models | ||||
| DeepSeek-v3.2 | 0.858 | 0.259 | 0.407 | 0.592 |
| GPT-4.1 | 0.810 | 0.269 | 0.609 | 0.579 |
| GPT-5-mini | 0.877 | 0.397 | 0.605 | 0.605 |
| GPT-5.2 | 0.858 | 0.311 | 0.565 | 0.578 |
| GPT-5.3-codex | 0.853 | 0.380 | 0.685 | 0.688 |
| Claude-Sonnet-4 | 0.869 | 0.376 | 0.582 | 0.595 |
| Claude-Sonnet-4.6 | 0.880 | 0.475 | 0.739 | 0.714 |
| Instruct Models | ||||
| Qwen3-8B | 0.833 | 0.155 | 0.398 | 0.458 |
| Qwen3-4B | 0.854 | 0.119 | 0.331 | 0.354 |
| Qwen3-1.7B | 0.843 | 0.092 | 0.350 | 0.331 |
| Specialized Alignment Metrics | ||||
| LeanScorer | 0.632 | n/a | n/a | n/a |
| BLEU (best F1) | 0.404 | n/a | n/a | n/a |
| FormalRx-8B (Ours) | 0.881 | 0.709 | 0.750 | 0.729 |
Table 2. Performance across three FormalRx sizes. Verdict and categorization are well-learned even at 1.7B parameters, while the generative tasks (localization and correction) benefit most from scale: FormalRx-8B wins on both. The non-monotonic dip on categorization at 4B is consistent with the paper's observation that 28-class classification is sensitive to factors beyond raw scale.
| Method | Verdict (F1) | Categorization (F1) | Localization (Acc) | Correction (Acc) |
|---|---|---|---|---|
| FormalRx-1.7B | 0.882 | 0.725 | 0.739 | 0.708 |
| FormalRx-4B | 0.889 | 0.508 | 0.713 | 0.654 |
| FormalRx-8B | 0.881 | 0.709 | 0.750 | 0.729 |
@inproceedings{wang2026formalrx,
title = {FormalRx: Rectify and eXamine Semantic Failures in Autoformalization},
author = {Wang, Haocheng and Huang, Baiyu and Wan, Yingjia and Zhu, Xiao and Liu, Xiaoyang and Huang, Yinya and Guo, Zhijiang},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026},
}