ACM MM '26 Dataset Track Submission · Project Page

MPF-Bench A Programmatically Verifiable Benchmark Family for Fine-Grained Visual Reasoning in VLMs

MPF-Bench reformulates natural images as localized completion problems with deterministic labels, exact scoring, zero-cost annotation, and controlled difficulty across grid size, candidate count, and mask shape.

Xiuyuan Zhu^1,2 OpenReview Ke Lu^1,3 OpenReview Hao Wu^1,2 OpenReview Zijin Du¹ OpenReview Yuan Lei^1,2 OpenReview Jian Xue¹ OpenReview · Corresponding Author

¹ University of Chinese Academy of Sciences

² State Key Laboratory of Communication Content Cognition

³ Peng Cheng Laboratory

Dataset Code

33k source images sampled from COCO and Flickr30K

6,000 released MPF test instances in total across 6 configurations

Fig. A Task Formulation

predict missing patch

1 2 3 4

The masked target is selected from information-rich patches rather than from uniformly random locations.

A natural image is partitioned into patches, and the masked target is selected from information-rich regions rather than uniformly at random. The model must return the candidate patch index that best fills the missing region.

Abstract

Benchmarking Fine-Grained Visual Reasoning With Exact Verification

Vision-language models perform strongly on many multimodal tasks, but fine-grained visual reasoning is still difficult to evaluate in a controlled and reproducible way. Existing benchmarks are often static, annotation-heavy, and limited in controllable difficulty. MPF-Bench introduces a programmatically generated benchmark family in which each instance is defined directly by construction.

This yields deterministic labels, exact scoring, zero-cost annotation, and controlled difficulty via grid size, candidate count, and mask shape. The same verifiable structure also makes MPF reusable as a self-supervised training signal, although benchmark evaluation remains the primary goal.

Core Contributions

Programmatic generation of fine-grained visual reasoning instances from natural images.
Controllable benchmark family instead of a single fixed evaluation slice.
Deterministic labels and exact scoring without human annotation.
Evidence that MPF is difficult in zero-shot settings yet learnable with targeted training.

Task

From Conventional VQA to Programmatic Patch Completion

Fig. 1

Why MPF-Bench Differs From Static QA Benchmarks

`fig/MPF_vs_others.pdf` `trim=0 20 210 0`

Comparison between conventional VQA benchmarks and MPF-Bench.

Comparison between conventional VQA benchmarks relying on human annotations and MPF-Bench, which enables fully automatic data generation, supervision, and evaluation.

Prompt Protocol

Fixed Composite-Image Evaluation Setup

“You are a professional image analysis expert. Given one masked image and its candidate patches, select the single candidate that best fills the masked region. Judge continuity, texture, geometry, color, and semantic plausibility. Return only the final patch index inside <mpf> and </mpf>.”

99.5%+ output-format accuracy

< 1% out-of-candidate rate

exact numeric parsing from `<mpf>` tag

These statistics indicate that the evaluation largely factors out failures caused by output-format errors, instruction-following issues, OCR mistakes, or other model-specific response artifacts. As a result, performance drops on MPF-Bench more faithfully reflect weaknesses in fine-grained visual perception and local reasoning, rather than failures to follow the answer protocol itself.

Benchmark Family

From a Default Pipeline to a Configurable Benchmark Family

Fig. 2

Programmatic Data Pipeline

`fig/MPF_data_pipeline.pdf` `trim=5 580 5 610`

Overview of the MPF-Bench pipeline. Fig. 2 illustrates the default `8×6`, `4-way` MPF workflow: MPF automatically generates reasoning instances, derives ground-truth labels, and evaluates model responses. The same verifiable structure can also be reused for self-supervised training with GRPO.

Beyond the default pipeline shown above, the released benchmark family varies along three configurable dimensions.

Dimension 1

Grid Size

`4×4`, `8×6`, `8×8`, and `12×12` regulate local context and patch granularity.

Dimension 2

Candidate Count

`4-way`, `8-way`, and `16-way` settings scale ambiguity and prove to be the strongest difficulty factor.

Dimension 3

Mask Shape

Rectangular and elliptical masks change boundary cues and local continuity conditions.

Fig. 3

Representative Configurations

`fig/mpf_sample.pdf` `trim=0 325 544 0`

Examples of MPF instances under two configurations. Left: `12×12` grid with 16 candidates. Right: `8×6` grid with 4 candidates and an elliptical mask.

Release Protocol

The current release contains 6 benchmark configurations and 6,000 MPF test instances in total.
Each configuration contains 1,000 MPF test instances.
Source images and benchmark test images are split to avoid leakage.
Only derived metadata is redistributed, such as patch indices and coordinates.
Difficulty scores record entropy, distractor similarity, and boundary ambiguity.

Zero-Shot Results

Current VLMs Degrade Sharply As Local Ambiguity Increases

Main Takeaways

Accuracy generally declines as MPF configurations become more demanding.
Increasing candidate count hurts more consistently than refining the grid alone.
The hardest `12×12`, 16-way setting remains far from saturated across models.
MPF-Bench is discriminative for both open-source and proprietary systems.

Hardest Setting Snapshot

`12×12`, 16-way, rect · random baseline `6.25%`

Qwen2-VL

6.5%

Qwen2.5-VL

9.5%

Qwen3-VL

21.5%

GPT-5.1

14.5%

Seed-2.0

34.0%

Gemini-3-Flash

28.6%

Kimi-K2.5

37.6%

Fig. 4 + Table 1

Representative Accuracy Sweep Across Configurations

`fig/mpf_accuracy_sweep.pdf` main paper sweep

Zero-shot accuracy on MPF-Bench under increasingly difficult configurations.

Configuration	Qwen2-VL	Qwen2.5-VL	Qwen3-VL	GPT-5.1	Seed-2.0	Gemini-3-Flash	Kimi-K2.5
`4×4`, 4-way, rect	25.0	38.0	46.0	47.5	67.5	81.0	71.5
`8×6`, 4-way, rect	29.0	31.5	57.5	47.5	72.5	81.5	76.5
`8×6`, 4-way, ellipse	36.0	30.5	56.0	55.5	78.0	82.9	73.5
`8×6`, 8-way, rect	14.0	18.5	37.5	30.0	50.5	69.0	53.1
`8×8`, 8-way, rect	14.5	20.5	39.0	31.5	57.0	65.5	61.1
`12×12`, 16-way, rect	6.5	9.5	21.5	14.5	34.0	28.6	37.6

Secondary Utility

MPF Is Not Only Difficult, But Also Learnable

Training Setup

SFT on about 1,000 MPF samples with reasoning traces distilled from Qwen2.5-VL-72B.
GRPO on about 24,000 additional MPF instances with 16 responses per input for 3 epochs.
One-epoch refinement on LLaVA-Instruct-150K to recover general instruction following.
Total training time is about 30 hours on `8×H20` GPUs.

Fig. 5

Qualitative Shift After MPF Training

`fig/mpf_qualitative_comparison.pdf` `trim=5 15 5 10`

Qualitative comparison before and after MPF training.

VED 9 → 18 visual evidence density increases

LC 32.6 → 18.2 linguistic compactness improves

Table 2

Before-and-After Training Results

Model	MMBench	SEED-Bench	POPE	HallusionBench	MPF-Bench
Qwen2.5-VL-7B	83.9	73.4	86.9	64.0	27.3
Qwen2.5-VL-7B + MPF	84.4	78.3	86.4	68.1	93.1
InternVL3-1B	57.6	69.4	83.9	47.5	21.6
InternVL3-1B + MPF	58.1	69.6	84.0	50.6	91.8

MPF training yields large in-domain gains, while transfer to external multimodal benchmarks remains moderate.

Discussion / FAQ

Common Questions About Scope, Validity, and Utility

This section clarifies the intended scope of MPF-Bench and addresses common questions about benchmark design, construct validity, and training utility. Our goal is to be precise about what MPF-Bench measures, what it does not claim to measure, and why training is positioned as a secondary use rather than the main contribution.

Is MPF-Bench primarily a benchmark or a training method?

MPF-Bench is primarily a benchmark family. Its main contribution is a programmatically generated, controllable, and exactly verifiable evaluation framework for fine-grained visual reasoning.

We additionally study MPF as a secondary self-supervised training signal because the same task structure naturally provides deterministic rewards, but this training use is auxiliary rather than the paper's primary claim.

Does the task mainly measure OCR, formatting, or instruction following?

We designed the protocol to minimize non-target confounds. All models receive the same composite image with the same layout and candidate order, rather than model-specific multi-image inputs.

Output-format accuracy exceeds 99.5%, and the out-of-candidate prediction rate stays below 1% across nearly all settings. This suggests that the main difficulty comes from selecting the correct patch under visual ambiguity, not from formatting failures.

Why use settings such as `8×6`, 4-way, or elliptical masks?

MPF-Bench is a benchmark family rather than a single fixed slice. The released benchmark varies along three explicit difficulty axes: grid size, candidate count, and mask shape.

The 8×6, 4-way, rectangular setting is used only as a practical working configuration for the training study. It is not the definition of the benchmark itself.

If MPF training pushes accuracy above 90%, does the benchmark saturate too quickly?

We view "difficult but learnable" as a feature rather than a contradiction. In zero-shot evaluation, several strong models remain close to random chance on the hardest 12×12, 16-way setting, showing that harder slices are far from saturated.

High in-domain accuracy after targeted training demonstrates learnability on one working slice, but it does not collapse the broader benchmark family, because new slices can still be created by increasing candidate ambiguity and other difficulty factors.

Does MPF really test reasoning, rather than low-level seam matching?

MPF is intended to measure context-conditioned local compatibility, not "pure semantics" in isolation. The task requires the model to identify which candidate is most consistent with the surrounding visual context, including texture, geometry, color, material, and semantic plausibility.

To reduce trivial cases, the pipeline filters low-information patches, samples distractors from spatially distinct regions while preserving coarse plausibility, and records factors such as distractor similarity and boundary ambiguity. We do not claim that all low-level cues are eliminated, but we do claim that MPF provides a scalable and verifiable stress test for fine-grained local reasoning under controlled ambiguity.

Why are transfer gains on external benchmarks only moderate?

The transfer gains are intentionally presented as moderate, not universal. This is consistent with our positioning: MPF-Bench is primarily an evaluation tool for a specific fine-grained capability, and only secondarily a self-supervised training signal.

Strong in-domain gains show that the task is learnable and behaviorally meaningful, while moderate external gains suggest that MPF captures a real but relatively narrow component of multimodal competence rather than a universal recipe for improving all benchmarks.

Resources

Code, Dataset, and Benchmark Assets

Code

Repository for benchmark generation, evaluation scripts, and project-page source.

Dataset

Released MPF benchmark family configurations and metadata on Hugging Face.

Citation

BibTeX

Cite This Project

If you find this project useful, please cite it with the following BibTeX entry.

@misc{zhu2026mpfbench,
  title  = {MPF-Bench: A Programmatically Verifiable Benchmark Family for Fine-Grained Visual Reasoning in VLMs},
  author = {Xiuyuan Zhu and Ke Lu and Hao Wu and Zijin Du and Yuan Lei and Jian Xue},
  year   = {2026},
  url    = {https://xyzzzh.github.io/MPF-Bench/}
}