HG-Bench: Multi-Page Handwritten Answer-Region Grounding for Automated Homework Assessment

Abstract

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions linked by a hierarchical containment constraint.

We introduce HG-Bench, a benchmark of 500 human-annotated K–12 homework samples curated from a 1,489,278-image source pool, paired with a page-aware evaluation protocol that separately measures complete-answer localization ($\mathcal{F}_A$) and step-level decomposition ($\mathcal{F}_S^{\mu}$), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text.

Across nine frontier closed-source APIs (GPT-5.4, Claude-Sonnet-4.6, Doubao-Seed-2.0-Pro, Gemini-3.0-Pro-Preview) and competitive open-weight VLMs (Qwen3.5-397B-A17B, GLM-5V-Turbo, Kimi K2.5, GLM-4.6V 9B), no zero-shot system exceeds 55.22% on $\mathcal{F}_A$ or 48.22% on $\mathcal{F}_S^{\mu}$, while a single-stage GLM-4.6V 9B reference fine-tuned on ~10k in-domain examples reaches 74.97 / 72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

500

Test Samples

Stratified across subject, grade, page count, and answer type. Curated from a 1.49M raw image pool.

Baselines Evaluated

5 closed-source APIs + 4 open-weight VLMs, all queried through a unified page-aware prompt.

2×

Evaluation Levels

Question-level $\mathcal{F}_A$ for coarse answer-region detection plus step-level $\mathcal{F}_S^{\mu}$ for fine-grained decomposition.

+24.04

Step-Level F1 Headroom

Absolute gap between best zero-shot baseline (Gemini-3.0-Pro, 48.22) and our reference SFT (72.26).

The HG-Bench Dataset

Source pool & collection

HG-Bench is curated from 1,489,278 anonymized student homework images spanning multiple subjects and grade levels. From this raw pool, 10,420 valid samples are annotated and partitioned into a 500-sample held-out test set (HG-Bench) and a 9,920-sample training pool (HG-SFT), with 250 samples held out from each channel:

Enterprise channel (6,765 samples) — formal answer sheets and standardized exam grading, multi-page samples with a high proportion of multi-step problems.
Consumer channel (3,655 samples) — homework photos uploaded by general users of a public-facing consumer homework-photo application, typically single-page with in-the-wild capture variation.

Annotation

12 trained annotators draw question-level and ordered step-level boxes under a strict hierarchical containment rule, with 5 senior reviewers independently accepting or returning each annotation. Inter-annotator agreement on a 50-sample subset annotated by two independent annotators: mean IoU 0.86, 85% of boxes match at $\mathrm{IoU} = 0.5$, Cohen's $\kappa = 0.83$ for binary box-matching, Fleiss' $\kappa = 0.81$ for discrete annotation fields (question-type, containment, ordering) — all in the substantial agreement band.

Figure 2. HG-Bench data engine. Four stages: (I) collection from enterprise B2B scans and a consumer photo application; (II) filtering for usability and student-answer content; (III) two-level hierarchical annotation; (IV) verification and stratified split into HG-Bench (test) and HG-SFT (train). See the paper for the full pipeline figure.

Summary statistics

500

Samples

Subjects

1.8

Mean pages / sample

5.2

Mean question boxes / page

2.9

Mean step boxes / question (when present)

0.34

Fraction of step-bearing pages

Figure 3. HG-Bench composition. (a) Answers per sample follow a long-tail distribution in both channels. (b) Multi-step problems are over-represented in the enterprise channel. (c) Page count per sample. (d) Question-type composition. The 500-sample benchmark is stratified along these four axes. See the paper for the full composition figure.

Task Formulation

Given a sequence of $P$ handwritten homework page images $\{\mathbf{I}_p\}_{p=1}^{P}$, the model must output a JSON array $\{q_i\}_{i=1}^{N}$ where each element corresponds to one question and contains:

a fixed type field complete_answer_box,
a page index $p_i \in \{1,\dots,P\}$,
a question-level box $\mathbf{b}_i \in [0,1000]^4$ in xyxy normalized format, enclosing the full handwritten answer to the question,
an optional ordered list of step boxes $\{s_{i,j}\}_{j=1}^{K_i}$, each carrying a step_id (one-indexed in the student's writing order) and its own box $\mathbf{b}_{i,j}$.

Every step box must be fully contained within its parent question-level box (hierarchical consistency). A schematic example of the expected output:

[
  { "box_2d": [100, 200, 180, 300],
    "type": "complete_answer_box" },
  { "box_2d": [400, 220, 490, 320],
    "type": "complete_answer_box",
    "steps": [
      {"box_2d": [410, 230, 440, 320], "step_id": 1},
      {"box_2d": [450, 230, 480, 320], "step_id": 2}
    ]
  },
  { "box_2d": [500, 220, 580, 780],
    "type": "complete_answer_box",
    "steps": [
      {"box_2d": [510, 230, 540, 780], "step_id": 1},
      {"box_2d": [550, 230, 580, 780], "step_id": 2},
      {"box_2d": [590, 230, 620, 780], "step_id": 3}
    ]
  }
]

Page-Aware Evaluation Protocol

Unlike standard grounding tasks that flatten all bounding boxes across an entire document, HG-Bench's evaluation operates strictly at the page level: within each page, we perform greedy one-to-one matching between ground-truth $\mathcal{G}$ and predicted $\mathcal{P}$ boxes under an IoU constraint ($\mathrm{IoU} \geq 0.5$), yielding per-page TP, FP, FN counts.

Primary metrics

$\mathcal{F}_A$ — Complete-Answer Area F$_1$ (macro)

Per-page box-level F$_1$ averaged within each multi-page sample, then macro-averaged across all successfully evaluated samples:

$$\mathcal{F}_A = \frac{1}{|\mathcal{S}_{\text{succ}}|}\sum_{s \in \mathcal{S}_{\text{succ}}} \left(\frac{1}{|\mathcal{M}_s|}\sum_{m \in \mathcal{M}_s} \mathrm{F}_1(s,m)\right) \times 100$$

$\mathcal{F}_S^{\mu}$ — Step-Level Micro F$_1$

Step boxes are highly sparse (many pages contain none). We micro-aggregate step-level TP, FP, FN only over step-bearing pages to avoid bias from many empty pages, computing a single unified F$_1$.

Supplementary metrics (reported in Table 1, defined in Appendix D)

$\mathcal{F}_S^{M}$ — macro step F$_1$ averaged over step-bearing samples (complements $\mathcal{F}_S^{\mu}$).
$\mathrm{Succ}\%$ — parse-success rate (fraction of samples returning a valid JSON array after at most one retry).
$\bar{\mathcal{S}}$ — unified composite score averaged over all 500 samples (failed parses count as 0); exposes the cost of format failures.
$\mathrm{Rep}\%$ — fraction of outputs containing repeated textual or structural patterns (lower is better).

Main Results

Model	$\mathcal{F}_A$	$\mathcal{F}_S^{\mu}$	$\mathcal{F}_S^{M}$	Succ%	$\bar{\mathcal{S}}$	Rep%
Closed-source frontier APIs
GPT-5.4	14.91	1.55	1.38	100.0	8.12	0.4
Claude-Sonnet-4.6	16.83	1.63	1.21	99.2	8.76	2.8
Doubao-Seed-2.0-Pro (2026-02-15)	52.65	44.78	42.59	49.4	21.22	0.0
Doubao-Seed-2.0-Pro (2026-04-01)	55.22	40.11	34.88	99.8	42.70	0.2
Gemini-3.0-Pro-Preview	50.90	48.22	37.58	100.0	42.33	6.0
Open-weight baselines
Qwen3.5-397B-A17B	42.71	18.15	17.73	94.2	32.73	4.0
GLM-5V-Turbo	46.69	26.29	23.78	100.0	40.10	0.4
Kimi K2.5	31.21	7.42	7.21	79.4	20.18	1.0
GLM-4.6V 9B (base)	34.15	7.65	4.60	100.0	29.46	3.8
Reference SFT system (Ours)
GLM-4.6V-9B + HG-SFT	74.97	72.26	48.25	100.0	71.53	0.8
Δ over best prior	↑ 19.75	↑ 24.04	↑ 5.66	–	↑ 28.83	–

Table 1. Main results on HG-Bench ($N = 500$ samples for every model). Bold = best; underline = second-best. All localization values scaled by ×100.

Key insights

1. Frontier VLMs plateau well below a trained reference, and scale alone does not close the gap.

No zero-shot baseline — closed- or open-source — exceeds 55.22 on $\mathcal{F}_A$ or 48.22 on $\mathcal{F}_S^{\mu}$. The ~10k-example reference system reaches 74.97 / 72.26, leaving headroom of roughly 20 and 24 absolute points. The 397B-parameter Qwen3.5-397B-A17B remains at 42.71 / 18.15, weaker than several smaller closed APIs and below the consumer-class GLM-5V-Turbo on $\mathcal{F}_S^{\mu}$.

2. Step-level grounding is the dominant capability gap.

Every zero-shot baseline degrades sharply from $\mathcal{F}_A$ to $\mathcal{F}_S^{\mu}$. Open-weight Kimi K2.5 and GLM-4.6V 9B fall to single-digit step scores, and even commercial GPT-5.4 and Claude-Sonnet-4.6 nearly collapse (1.55 and 1.63). The reference SFT system narrows the $\mathcal{F}_A \to \mathcal{F}_S^{\mu}$ gap to ~3 points — we accordingly recommend $\mathcal{F}_S^{\mu}$ as the primary ranking axis.

3. Targeted SFT recovers genuinely hard cases.

Of the 115 universally-missed samples at $T = 0.5$ (no baseline reaches title-F$_1 \geq 0.5$), the reference SFT reaches title-F$_1 \geq 0.5$ on 68 samples (59.1%). At the stricter $T = 0.7$ it still rescues 45.7% of the 223 universally-missed samples — the SFT delta is not a uniform shift over easy samples; it specifically recovers samples beyond the reach of frontier zero-shot systems.

Inter-model agreement

Per-sample title-F$_1$ across the nine baselines reveals a smooth difficulty gradient and confirms HG-Bench is far from saturated:

At $T = 0.5$, only 3 / 500 samples (0.6%) are solved by all nine baselines; 115 / 500 (23.0%) are missed by every one.
At the stricter $T = 0.7$, the universally-missed set rises to 44.6%, while the all-passed set remains at 0.6%.
Among the 115 universally-missed samples, 59 (51%) come from enterprise multi-page inputs and 56 (49%) from consumer single-page — hard samples are not dominated by multi-page inputs; handwriting density and step decomposition remain the main residual challenges.

Qualitative Cases

Each case below shows the same multi-column layout: ground truth on the left, several representative baselines in the middle, our reference SFT on the right. Question-level boxes and step-level boxes are drawn in contrasting colors.

Case 1. Universally-missed sample (universal rescue). One of the 115 samples on which no zero-shot baseline reaches title-F$_1 \geq 0.5$. Every closed-source frontier API and every open-weight baseline produce highly fragmented or mis-localized boxes; the reference SFT recovers both the question-level answer regions and the ordered step decompositions.

Case 2. Multi-page enterprise sample. An enterprise answer sheet spanning two pages, illustrating page-misalignment and index-shuffling failures: several baselines attribute boxes to the wrong page index or emit step IDs out of writing order across the page break.

Case 3. Over-segmentation of step boxes. A single multi-step derivation is split into too many step boxes by several baselines (e.g., each "=" or arithmetic operator gets its own box), inflating the step count and confusing downstream per-step grading.

Case 4. Under-segmentation of step boxes. The complementary failure: several distinct derivation steps are merged into one large box, losing the per-step grading signal even when the overall question-level box is roughly correct.

Reference SFT System

Single-stage SFT No RL No synthetic CPT ~24 GPU-hours

To verify learnability and provide a reproducible lower-bound reference, we fine-tune GLM-4.6V 9B on the 9,920-sample HG-SFT training pool using single-stage supervised fine-tuning — no reinforcement learning, no synthetic continued pre-training, and no out-of-domain data mixing. Train/test deduplication: every training image is checked against the test pool by perceptual hashing (pHash, Hamming distance $\leq 5$) plus exact metadata match; before dedup the raw training pool contained 14,264 samples; pHash filtering removed 4,344 near-duplicates, yielding the final 9,920.

Hyperparameters

Base checkpoint	GLM-4.6V 9B
Precision	bfloat16
Sequence length	32,768
Visual tokens per image	up to 10,000 (variable shape, jitter 0.75–1.25)
Optimizer	AdamW, $(\beta_1,\beta_2)=(0.9, 0.95)$, $\epsilon = 10^{-8}$
Weight decay	0.1
Gradient clipping	1.0
LR schedule	cosine, 30-step warmup
Peak / min LR	$1\times10^{-6}$ / $5\times10^{-7}$
Global batch size	32 (micro-batch 1)
Training steps	930 (= 3 epochs × 9,920 examples)
Hardware	1 node, 8×H100-80GB
Tensor / Context / Sequence parallel	2 / 2 / enabled (incl. ViT tower)
Wall-clock time	~3 hours
Total compute	~24 GPU-hours

Download & Reproducibility

The following artifacts are available or planned for public release under terms permitting educational research use:

Evaluation harness — available in this repository: evaluate.py, prompt.py, and sample files in examples/.
HG-Bench test set — 500 multi-page samples with question-level and step-level box annotations (coming soon on Hugging Face).
HG-SFT training pool — 9,920 deduplicated samples sharing the annotation protocol with HG-Bench (coming soon).
Reference checkpoint — the fine-tuned GLM-4.6V-9B + HG-SFT model reaching 74.97 / 72.26 (coming soon).
Anonymization & ethics — all images are anonymized; no personally identifying information about individual students is exposed. Intended use is teacher-side grading assistance, not automated student evaluation.

Prompt template (English translation)

You are a high-precision visual annotation expert for educational
homework and exam papers. Your task is to identify each student's
handwritten answer regions in the provided images and output two
types of bounding boxes, with coordinates in [xmin, ymin, xmax, ymax]
format normalized to [0, 1000].

Box types
  (a) complete_answer_box: tightly contains the student's entire answer
      to one question. For multiple-choice and true/false items, if both
      a bubble-fill region and a hand-written letter answer are present,
      prefer the bubble-fill region.
  (b) step_box: used for multi-step answers (computation, derivation,
      multi-blank fill-in). Each step or blank is boxed separately and
      assigned a step_id starting from 1 in the student's writing order.
      Every step box must be nested inside the corresponding
      complete_answer_box.

Annotation rules
  - Box only the student's own marks; do not box printed question text
    or teacher corrections.
  - Step boxes must fully contain the handwritten content of that step
    or blank.
  - For a multi-blank item, each blank becomes a separate step box in
    left-to-right, top-to-bottom order.

Output format
  Emit a JSON list. Each element is one question-level object with
  fields box_2d, type (fixed to complete_answer_box), and an optional
  steps list whose elements each carry their own box_2d and integer
  step_id. Items must be emitted in the order the student answered.
  Coordinates must tightly bound the handwriting without cropping any
  character.

HG-Bench: A Benchmark for Multi-Page Handwritten
Answer-Region Grounding in Automated Homework Assessment