This section clarifies the intended scope of MPF-Bench and addresses common questions about
benchmark design, construct validity, and training utility. Our goal is to be precise about
what MPF-Bench measures, what it does not claim to measure, and why training is positioned as a
secondary use rather than the main contribution.
Q1
Is MPF-Bench primarily a benchmark or a training method?
MPF-Bench is primarily a benchmark family. Its main contribution is a programmatically
generated, controllable, and exactly verifiable evaluation framework for fine-grained visual
reasoning.
We additionally study MPF as a secondary self-supervised training signal because the same task
structure naturally provides deterministic rewards, but this training use is auxiliary rather
than the paper's primary claim.
Q2
Does the task mainly measure OCR, formatting, or instruction following?
We designed the protocol to minimize non-target confounds. All models receive the same
composite image with the same layout and candidate order, rather than model-specific
multi-image inputs.
Output-format accuracy exceeds 99.5%, and the out-of-candidate prediction rate stays below 1%
across nearly all settings. This suggests that the main difficulty comes from selecting the
correct patch under visual ambiguity, not from formatting failures.
Q3
Why use settings such as 8×6, 4-way, or elliptical masks?
MPF-Bench is a benchmark family rather than a single fixed slice. The released benchmark varies
along three explicit difficulty axes: grid size, candidate count, and mask shape.
The 8×6, 4-way, rectangular setting is used only as a practical working
configuration for the training study. It is not the definition of the benchmark itself.
Q4
If MPF training pushes accuracy above 90%, does the benchmark saturate too quickly?
We view "difficult but learnable" as a feature rather than a contradiction. In zero-shot
evaluation, several strong models remain close to random chance on the hardest
12×12, 16-way setting, showing that harder slices are far from saturated.
High in-domain accuracy after targeted training demonstrates learnability on one working slice,
but it does not collapse the broader benchmark family, because new slices can still be created
by increasing candidate ambiguity and other difficulty factors.
Q5
Does MPF really test reasoning, rather than low-level seam matching?
MPF is intended to measure context-conditioned local compatibility, not "pure semantics" in
isolation. The task requires the model to identify which candidate is most consistent with the
surrounding visual context, including texture, geometry, color, material, and semantic
plausibility.
To reduce trivial cases, the pipeline filters low-information patches, samples distractors from
spatially distinct regions while preserving coarse plausibility, and records factors such as
distractor similarity and boundary ambiguity. We do not claim that all low-level cues are
eliminated, but we do claim that MPF provides a scalable and verifiable stress test for
fine-grained local reasoning under controlled ambiguity.
Q6
Why are transfer gains on external benchmarks only moderate?
The transfer gains are intentionally presented as moderate, not universal. This is consistent
with our positioning: MPF-Bench is primarily an evaluation tool for a specific fine-grained
capability, and only secondarily a self-supervised training signal.
Strong in-domain gains show that the task is learnable and behaviorally meaningful, while
moderate external gains suggest that MPF captures a real but relatively narrow component of
multimodal competence rather than a universal recipe for improving all benchmarks.