Preprint · 2026

HG-Bench: A Benchmark for Multi-Page Handwritten
Answer-Region Grounding in Automated Homework Assessment

Chuangxin Zhao1, Boyan Shi2, Yanling Wang3, Yijian Lu3, Canran Xiao4, Jiali Chen3, Jun Xia5, Yan Wang3, Ji Qi3,✉, Juanzi Li3,✉

1Institute of Automation, Chinese Academy of Sciences  ·  2Beijing Jiaotong University  ·  3Tsinghua University  ·  4Sun Yat-sen University  ·  5The Hong Kong University of Science and Technology (Guangzhou)
✉ Corresponding authors

arXiv (soon) Paper (PDF) Code Dataset (soon) Checkpoint (soon) BibTeX
HG-Bench task overview and sample diversity
Figure 1. HG-Bench at a glance. Top: the unified evaluation prompt (left, purple), a representative two-page handwritten exam input with the model's predicted question-level boxes (red) and step-level boxes (blue), and the model's chain-of-thought reasoning trace identifying each answer region (right, orange). Bottom: three additional HG-Bench samples spanning a function derivation, a calculation problem, and a Chinese long-form writing answer — illustrating layout, handwriting density, and subject diversity.

Abstract

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions linked by a hierarchical containment constraint.

We introduce HG-Bench, a benchmark of 500 human-annotated K–12 homework samples curated from a 1,489,278-image source pool, paired with a page-aware evaluation protocol that separately measures complete-answer localization ($\mathcal{F}_A$) and step-level decomposition ($\mathcal{F}_S^{\mu}$), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text.

Across nine frontier closed-source APIs (GPT-5.4, Claude-Sonnet-4.6, Doubao-Seed-2.0-Pro, Gemini-3.0-Pro-Preview) and competitive open-weight VLMs (Qwen3.5-397B-A17B, GLM-5V-Turbo, Kimi K2.5, GLM-4.6V 9B), no zero-shot system exceeds 55.22% on $\mathcal{F}_A$ or 48.22% on $\mathcal{F}_S^{\mu}$, while a single-stage GLM-4.6V 9B reference fine-tuned on ~10k in-domain examples reaches 74.97 / 72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

500
Test Samples
Stratified across subject, grade, page count, and answer type. Curated from a 1.49M raw image pool.
9
Baselines Evaluated
5 closed-source APIs + 4 open-weight VLMs, all queried through a unified page-aware prompt.
Evaluation Levels
Question-level $\mathcal{F}_A$ for coarse answer-region detection plus step-level $\mathcal{F}_S^{\mu}$ for fine-grained decomposition.
+24.04
Step-Level F1 Headroom
Absolute gap between best zero-shot baseline (Gemini-3.0-Pro, 48.22) and our reference SFT (72.26).

The HG-Bench Dataset

Source pool & collection

HG-Bench is curated from 1,489,278 anonymized student homework images spanning multiple subjects and grade levels. From this raw pool, 10,420 valid samples are annotated and partitioned into a 500-sample held-out test set (HG-Bench) and a 9,920-sample training pool (HG-SFT), with 250 samples held out from each channel:

Annotation

12 trained annotators draw question-level and ordered step-level boxes under a strict hierarchical containment rule, with 5 senior reviewers independently accepting or returning each annotation. Inter-annotator agreement on a 50-sample subset annotated by two independent annotators: mean IoU 0.86, 85% of boxes match at $\mathrm{IoU} = 0.5$, Cohen's $\kappa = 0.83$ for binary box-matching, Fleiss' $\kappa = 0.81$ for discrete annotation fields (question-type, containment, ordering) — all in the substantial agreement band.

Figure 2. HG-Bench data engine. Four stages: (I) collection from enterprise B2B scans and a consumer photo application; (II) filtering for usability and student-answer content; (III) two-level hierarchical annotation; (IV) verification and stratified split into HG-Bench (test) and HG-SFT (train). See the paper for the full pipeline figure.

Summary statistics

500
Samples
5
Subjects
1.8
Mean pages / sample
5.2
Mean question boxes / page
2.9
Mean step boxes / question (when present)
0.34
Fraction of step-bearing pages
Figure 3. HG-Bench composition. (a) Answers per sample follow a long-tail distribution in both channels. (b) Multi-step problems are over-represented in the enterprise channel. (c) Page count per sample. (d) Question-type composition. The 500-sample benchmark is stratified along these four axes. See the paper for the full composition figure.

Task Formulation

Given a sequence of $P$ handwritten homework page images $\{\mathbf{I}_p\}_{p=1}^{P}$, the model must output a JSON array $\{q_i\}_{i=1}^{N}$ where each element corresponds to one question and contains:

Every step box must be fully contained within its parent question-level box (hierarchical consistency). A schematic example of the expected output:

[
  { "box_2d": [100, 200, 180, 300],
    "type": "complete_answer_box" },
  { "box_2d": [400, 220, 490, 320],
    "type": "complete_answer_box",
    "steps": [
      {"box_2d": [410, 230, 440, 320], "step_id": 1},
      {"box_2d": [450, 230, 480, 320], "step_id": 2}
    ]
  },
  { "box_2d": [500, 220, 580, 780],
    "type": "complete_answer_box",
    "steps": [
      {"box_2d": [510, 230, 540, 780], "step_id": 1},
      {"box_2d": [550, 230, 580, 780], "step_id": 2},
      {"box_2d": [590, 230, 620, 780], "step_id": 3}
    ]
  }
]

Page-Aware Evaluation Protocol

Unlike standard grounding tasks that flatten all bounding boxes across an entire document, HG-Bench's evaluation operates strictly at the page level: within each page, we perform greedy one-to-one matching between ground-truth $\mathcal{G}$ and predicted $\mathcal{P}$ boxes under an IoU constraint ($\mathrm{IoU} \geq 0.5$), yielding per-page TP, FP, FN counts.

Primary metrics

$\mathcal{F}_A$ — Complete-Answer Area F$_1$ (macro)

Per-page box-level F$_1$ averaged within each multi-page sample, then macro-averaged across all successfully evaluated samples:

$$\mathcal{F}_A = \frac{1}{|\mathcal{S}_{\text{succ}}|}\sum_{s \in \mathcal{S}_{\text{succ}}} \left(\frac{1}{|\mathcal{M}_s|}\sum_{m \in \mathcal{M}_s} \mathrm{F}_1(s,m)\right) \times 100$$

$\mathcal{F}_S^{\mu}$ — Step-Level Micro F$_1$

Step boxes are highly sparse (many pages contain none). We micro-aggregate step-level TP, FP, FN only over step-bearing pages to avoid bias from many empty pages, computing a single unified F$_1$.

Supplementary metrics (reported in Table 1, defined in Appendix D)

Main Results

Model $\mathcal{F}_A$ $\mathcal{F}_S^{\mu}$ $\mathcal{F}_S^{M}$ Succ% $\bar{\mathcal{S}}$ Rep%
Closed-source frontier APIs
GPT-5.414.911.551.38100.08.120.4
Claude-Sonnet-4.616.831.631.2199.28.762.8
Doubao-Seed-2.0-Pro (2026-02-15)52.6544.7842.5949.421.220.0
Doubao-Seed-2.0-Pro (2026-04-01)55.2240.1134.8899.842.700.2
Gemini-3.0-Pro-Preview50.9048.2237.58100.042.336.0
Open-weight baselines
Qwen3.5-397B-A17B42.7118.1517.7394.232.734.0
GLM-5V-Turbo46.6926.2923.78100.040.100.4
Kimi K2.531.217.427.2179.420.181.0
GLM-4.6V 9B (base)34.157.654.60100.029.463.8
Reference SFT system (Ours)
GLM-4.6V-9B + HG-SFT74.9772.2648.25100.071.530.8
Δ over best prior↑ 19.75↑ 24.04↑ 5.66↑ 28.83

Table 1. Main results on HG-Bench ($N = 500$ samples for every model). Bold = best; underline = second-best. All localization values scaled by ×100.

Key insights

1. Frontier VLMs plateau well below a trained reference, and scale alone does not close the gap.

No zero-shot baseline — closed- or open-source — exceeds 55.22 on $\mathcal{F}_A$ or 48.22 on $\mathcal{F}_S^{\mu}$. The ~10k-example reference system reaches 74.97 / 72.26, leaving headroom of roughly 20 and 24 absolute points. The 397B-parameter Qwen3.5-397B-A17B remains at 42.71 / 18.15, weaker than several smaller closed APIs and below the consumer-class GLM-5V-Turbo on $\mathcal{F}_S^{\mu}$.

2. Step-level grounding is the dominant capability gap.

Every zero-shot baseline degrades sharply from $\mathcal{F}_A$ to $\mathcal{F}_S^{\mu}$. Open-weight Kimi K2.5 and GLM-4.6V 9B fall to single-digit step scores, and even commercial GPT-5.4 and Claude-Sonnet-4.6 nearly collapse (1.55 and 1.63). The reference SFT system narrows the $\mathcal{F}_A \to \mathcal{F}_S^{\mu}$ gap to ~3 points — we accordingly recommend $\mathcal{F}_S^{\mu}$ as the primary ranking axis.

3. Targeted SFT recovers genuinely hard cases.

Of the 115 universally-missed samples at $T = 0.5$ (no baseline reaches title-F$_1 \geq 0.5$), the reference SFT reaches title-F$_1 \geq 0.5$ on 68 samples (59.1%). At the stricter $T = 0.7$ it still rescues 45.7% of the 223 universally-missed samples — the SFT delta is not a uniform shift over easy samples; it specifically recovers samples beyond the reach of frontier zero-shot systems.

Inter-model agreement

Per-sample title-F$_1$ across the nine baselines reveals a smooth difficulty gradient and confirms HG-Bench is far from saturated:

Qualitative Cases

Each case below shows the same multi-column layout: ground truth on the left, several representative baselines in the middle, our reference SFT on the right. Question-level boxes and step-level boxes are drawn in contrasting colors.

Universal rescue case
Case 1. Universally-missed sample (universal rescue). One of the 115 samples on which no zero-shot baseline reaches title-F$_1 \geq 0.5$. Every closed-source frontier API and every open-weight baseline produce highly fragmented or mis-localized boxes; the reference SFT recovers both the question-level answer regions and the ordered step decompositions.
Multi-page case
Case 2. Multi-page enterprise sample. An enterprise answer sheet spanning two pages, illustrating page-misalignment and index-shuffling failures: several baselines attribute boxes to the wrong page index or emit step IDs out of writing order across the page break.
Over-segmentation case
Case 3. Over-segmentation of step boxes. A single multi-step derivation is split into too many step boxes by several baselines (e.g., each "=" or arithmetic operator gets its own box), inflating the step count and confusing downstream per-step grading.
Under-segmentation case
Case 4. Under-segmentation of step boxes. The complementary failure: several distinct derivation steps are merged into one large box, losing the per-step grading signal even when the overall question-level box is roughly correct.

Reference SFT System

Single-stage SFT No RL No synthetic CPT ~24 GPU-hours

To verify learnability and provide a reproducible lower-bound reference, we fine-tune GLM-4.6V 9B on the 9,920-sample HG-SFT training pool using single-stage supervised fine-tuning — no reinforcement learning, no synthetic continued pre-training, and no out-of-domain data mixing. Train/test deduplication: every training image is checked against the test pool by perceptual hashing (pHash, Hamming distance $\leq 5$) plus exact metadata match; before dedup the raw training pool contained 14,264 samples; pHash filtering removed 4,344 near-duplicates, yielding the final 9,920.

Hyperparameters

Base checkpointGLM-4.6V 9B
Precisionbfloat16
Sequence length32,768
Visual tokens per imageup to 10,000 (variable shape, jitter 0.75–1.25)
OptimizerAdamW, $(\beta_1,\beta_2)=(0.9, 0.95)$, $\epsilon = 10^{-8}$
Weight decay0.1
Gradient clipping1.0
LR schedulecosine, 30-step warmup
Peak / min LR$1\times10^{-6}$ / $5\times10^{-7}$
Global batch size32 (micro-batch 1)
Training steps930 (= 3 epochs × 9,920 examples)
Hardware1 node, 8×H100-80GB
Tensor / Context / Sequence parallel2 / 2 / enabled (incl. ViT tower)
Wall-clock time~3 hours
Total compute~24 GPU-hours

Download & Reproducibility

The following artifacts are available or planned for public release under terms permitting educational research use:

Prompt template (English translation)

You are a high-precision visual annotation expert for educational
homework and exam papers. Your task is to identify each student's
handwritten answer regions in the provided images and output two
types of bounding boxes, with coordinates in [xmin, ymin, xmax, ymax]
format normalized to [0, 1000].

Box types
  (a) complete_answer_box: tightly contains the student's entire answer
      to one question. For multiple-choice and true/false items, if both
      a bubble-fill region and a hand-written letter answer are present,
      prefer the bubble-fill region.
  (b) step_box: used for multi-step answers (computation, derivation,
      multi-blank fill-in). Each step or blank is boxed separately and
      assigned a step_id starting from 1 in the student's writing order.
      Every step box must be nested inside the corresponding
      complete_answer_box.

Annotation rules
  - Box only the student's own marks; do not box printed question text
    or teacher corrections.
  - Step boxes must fully contain the handwritten content of that step
    or blank.
  - For a multi-blank item, each blank becomes a separate step box in
    left-to-right, top-to-bottom order.

Output format
  Emit a JSON list. Each element is one question-level object with
  fields box_2d, type (fixed to complete_answer_box), and an optional
  steps list whose elements each carry their own box_2d and integer
  step_id. Items must be emitted in the order the student answered.
  Coordinates must tightly bound the handwriting without cropping any
  character.

Citation

If you find HG-Bench useful in your research, please consider citing:

@article{hgbench2026,
  title   = {{HG-Bench}: A Benchmark for Multi-Page Handwritten Answer-Region Grounding
             in Automated Homework Assessment},
  author  = {Chuangxin Zhao and Boyan Shi and Yanling Wang and Yijian Lu and
             Canran Xiao and Jiali Chen and Jun Xia and Yan Wang and
             Ji Qi and Juanzi Li},
  journal = {arXiv preprint arXiv:2603.XXXXX},
  year    = {2026}
}