AprNes - Testing Methodology

AprNes uses hardware verification test ROMs originally written to validate real NES console behavior. These are actual NES programs (6502 machine code) that exercise specific hardware features and report pass/fail. The same ROMs are used by every major NES emulator project to measure accuracy.

Test coverage spans all major NES subsystems:

CPU — All official 6502 instructions, addressing modes, timing, dummy reads/writes, interrupt behavior (NMI/IRQ/BRK interactions), branch timing, reset behavior
PPU — VBlank/NMI timing, sprite 0 hit, sprite overflow, open bus behavior, palette RAM, VRAM access, read buffer, even/odd frame toggle
APU — Frame counter timing (mode 0/1), length counter, IRQ flag timing, clock jitter, DMC rates, channel mixing, reset behavior
Mapper — MMC3 (Mapper 004) IRQ counter clocking, A12 detection, scanline timing, Rev A/B behavior
I/O — Controller reading accuracy, DPCM interference, DMA timing

2. Test ROM Sources

All ROMs come from the nes-test-roms collection, primarily authored by:

blargg (Shay Green) — The most comprehensive NES test ROM author. His suites cover CPU instructions (v3/v5), APU timing, PPU VBL/NMI, sprite hit/overflow, and more. Two result protocols: legacy (screen-based) and modern ($6000 memory-mapped).
bisqwit — CPU timing tests
other community authors — MMC3 IRQ tests, DMA interaction tests, controller tests

These are the same ROMs used by Mesen, Nestopia, FCEUX, and other reference emulators. Passing them indicates cycle-accurate or near-cycle-accurate emulation.

3. Test Runner Architecture

Headless Mode

The emulator has a built-in TestRunner.cs that runs in headless mode — no window, no audio, no frame rate limiter. The CPU/PPU/APU all tick at maximum speed. A single test ROM typically completes in under 1 second.

Two Scripts, Two Purposes

The testing workflow uses two complementary scripts with identical test lists (174 ROMs):

run_tests.sh — Lightweight validation script. Runs all tests and prints PASS/FAIL to stdout with a failure summary. Used for quick regression checks during development.
run_tests_report.sh — Full-featured report generator with modular output controlled by command-line flags.

Report Script Modes

run_tests_report.sh always prints PASS/FAIL results to stdout. Optional flags control additional outputs:

bash run_tests_report.sh                        # Quick: stdout only
bash run_tests_report.sh --json                 # + save report/results.json
bash run_tests_report.sh --screenshots          # + capture screenshots (PNG→WebP)
bash run_tests_report.sh --json --screenshots   # Full: JSON + screenshots + HTML report
bash run_tests_report.sh --no-build             # Skip MSBuild compilation step

The full pipeline (with all flags) performs:

Build the project with MSBuild (unless --no-build)
Run each of 174 ROM files through the headless emulator
Print PASS/FAIL to stdout for every test
If --screenshots: capture the final frame as PNG, convert to lossless WebP
If --json: collect results into report/results.json
If both --json and --screenshots: generate a single-file HTML report (report/index.html) with embedded data and screenshot references

4. Result Detection Mechanisms

Mechanism A: $6000 Memory Protocol

Modern blargg test ROMs use a memory-mapped status protocol. The test runner polls address $6000 every frame:

$6000 Value	Meaning	Action
`$80`	Test running	Continue waiting
`$81`	Reset requested	Wait 100ms, then soft reset
`$00`	Test passed	Exit with code 0
`$01-$7F`	Test failed (error code N)	Exit with code N

Result text is read from $6004+ as null-terminated ASCII. This gives detailed error messages like "Flag first set too late" or "Length counter not clocked correctly".

Mechanism B: Screen Stability Detection

Older blargg tests (2005 era) don't use the $6000 protocol. They render results directly to the PPU nametable. The test runner handles these with a multi-step heuristic:

After 120 frames (~2 sec), start sampling the screen buffer every frame
Compute a hash of the framebuffer (sampling every 37th pixel for speed)
When the hash stays identical for 90 consecutive frames (~1.5 sec), the screen is "stable"
Scan the PPU nametable (character map) for known result strings:
- "Passed" / "PASSED" → PASS
- "Failed" / "FAILED" → FAIL
- "$01" (hex on screen) → PASS
- "$02" ~ "$FF" (hex on screen) → FAIL
- "All tests complete" → PASS
- " 0/" (zero error count) → PASS

This approach reads the PPU nametable directly (not OCR on pixels), making it fast and reliable.

Mechanism C: CRC Matching

Some test ROMs produce results that depend on random CPU-PPU synchronization at power-on, generating one of several valid CRC values. These tests display a CRC on screen but cannot use a single check_crc call. The --expected-crc parameter accepts a comma-separated list of valid CRCs:

--expected-crc "159A7A8F,5E3DF9C4"

The test runner scans the PPU nametable for an 8-character hexadecimal string (bounded by non-hex characters to avoid partial matches). If found, it compares case-insensitively against the expected set. This mechanism is used in both the screen stability detection path and the timeout fallback path. Currently used by:

dma_2007_read — 2 valid CRCs (CPU-PPU sync dependent)
double_2007_read — 4 valid CRCs (CPU-PPU sync dependent)

5. Automation Features

Auto Soft Reset

Some test ROMs write $81 to $6000 to request a console reset (testing power-on/reset behavior). The runner detects this and automatically performs a soft reset after a 100ms delay, mimicking a human pressing the reset button. Supports up to 10 sequential resets per ROM.

Simulated Controller Input

Controller read tests need actual button presses. The --input parameter schedules timed button events:

--input "A:2.0,B:4.0,Select:6.0,Start:8.0,Up:10.0,Down:12.0,Left:14.0,Right:16.0"

Each button is pressed at the specified time (seconds) and held for 10 frames (~166ms). This lets tests like read_joy3/test_buttons verify that all 8 buttons are correctly detected in sequence.

Screenshot Capture (optional)

When enabled with --screenshots, the final frame of each test is captured as a 256x240 PNG, then converted to lossless WebP (typically 60-80% smaller). Screenshots serve as visual evidence — many test ROMs display their results on screen as text, showing exactly what passed or failed. For quick regression checks, screenshots can be skipped to save time.

Timeout Safety

Each ROM has a configurable --max-wait timeout (default 30 seconds, 120 for merged/multi-sub-test ROMs). If a test ROM enters an infinite loop or hangs, the runner terminates it gracefully and reports the last known state.

6. Test Suite Coverage

4apu_mixer — Channel mixing

6apu_reset — APU power/reset

9apu_test — APU frame counter

11blargg_apu_2005 — APU timing

2blargg_cpu_test5 — CPU instructions

5blargg_ppu_tests — PPU basics

3branch_timing — Branch cycle count

1cpu_dummy_reads — Dummy read cycles

2cpu_dummy_writes — Dummy write cycles

2cpu_exec_space — Execution from I/O

6cpu_interrupts_v2 — NMI/IRQ interaction

2cpu_reset — CPU reset behavior

1cpu_timing_test6 — Instruction timing

5dmc_dma_during_read — DMC DMA conflicts

5instr_misc — Misc instruction tests

17instr_test-v3 — All 6502 instructions

18instr_test-v5 — All 6502 instructions (v5)

3instr_timing — Instruction cycle timing

6mmc3_irq_tests — MMC3 IRQ counter

6mmc3_test — MMC3 behavior

6mmc3_test_2 — MMC3 behavior (v2)

11nes_instr_test — CPU instructions (alt)

1oam_read — OAM read behavior

1ppu_open_bus — PPU open bus

1ppu_read_buffer — PPU read buffer

11ppu_vbl_nmi — VBlank/NMI timing

4read_joy3 — Controller reading

2sprdma_and_dmc_dma — DMA conflicts

11sprite_hit_tests — Sprite 0 hit

5sprite_overflow — Sprite overflow

7vbl_nmi_timing — VBL/NMI timing

7. Command-Line Interface

The headless test runner is invoked directly via the emulator executable:

AprNes.exe --rom <file.nes> [options]

Option	Description
`--rom <path>`	ROM file to load (required)
`--wait-result`	Monitor $6000 / screen for test result
`--max-wait <sec>`	Timeout in seconds (default: 30)
`--time <sec>`	Run for exactly N seconds, then stop
`--screenshot <path>`	Save final frame as PNG
`--log <path>`	Write result line to file
`--soft-reset <sec>`	Trigger soft reset at N seconds
`--input <spec>`	Schedule button presses (e.g. "A:2.0,B:4.0")
`--expected-crc <list>`	Comma-separated valid CRCs for CRC-only tests (e.g. "159A7A8F,5E3DF9C4")
`--debug-log <path>`	Write CPU trace log

Exit codes: 0 = pass, 1-127 = fail (test error code), 255 = timeout/no result.

8. Emulator Design Requirements for QA Integration

To adopt this automated QA workflow, an emulator must expose a set of design interfaces. Below is a reference based on AprNes's architecture — the pattern is applicable to any NES emulator regardless of language.

8.1 Dual Entry Point: GUI vs Headless

The emulator should support two modes of operation from a single executable. When command-line arguments are present, it enters headless test mode; otherwise it launches the normal GUI.

// Program.cs — entry point
static int Main(string[] args)
{
    if (args.Length > 0)
        return TestRunner.Run(args);    // headless mode, returns exit code

    Application.Run(new MainForm());    // normal GUI mode
    return 0;
}

Key point: Main() returns int (not void) so the exit code can signal pass/fail to the calling script.

8.2 Headless Mode Flags

The emulator core needs static flags to suppress GUI and audio subsystems when running in test mode:

Flag	Purpose	Effect
`HeadlessMode = true`	Suppress window creation	No Form/Window is instantiated; rendering still runs to fill the framebuffer, but no display output occurs
`AudioEnabled = false`	Suppress audio output	APU still ticks (needed for timing tests), but no audio device is opened
`LimitFPS = false`	Remove frame rate limiter	Emulation runs at maximum CPU speed; a 30-second test completes in <1 second wall time
`exit = true`	Signal the main loop to stop	Set by the test runner when a result is detected; the `run()` loop checks this flag each frame

These flags allow the CPU/PPU/APU to continue ticking normally — only the I/O endpoints (display, speakers) are disabled. This ensures that timing-sensitive tests produce identical results in headless and GUI modes.

8.3 Per-Frame Callback (VideoOutput Event)

The emulator must provide a per-frame hook that fires after each frame is fully rendered. The test runner subscribes to this event to poll for results:

// In NesCore (emulator core):
public static event EventHandler VideoOutput;

// Fired at the end of each PPU frame (scanline 240, after VBlank begins):
VideoOutput?.Invoke(null, null);

// In TestRunner:
NesCore.VideoOutput += (sender, e) => {
    frameCount++;
    byte status = NesCore.NES_MEM[0x6000];  // poll test protocol
    // ... detect result, set NesCore.exit = true when done
};

This callback-driven design avoids tight polling loops and integrates cleanly with both GUI (refresh display) and headless (check test status) modes.

8.4 Memory and PPU RAM Access

The test runner needs direct read access to two memory regions:

Memory Region	Access Pattern	Purpose
`NES_MEM[0x6000..0x6FFF]`	CPU address space (WRAM)	Read $6000 status byte and $6004+ result text (blargg protocol)
`ppu_ram[0x2000..0x23BF]`	PPU nametable 0	Scan for "Passed"/"Failed" text on screen (older test ROMs)
`ScreenBuf1x[0..61439]`	Rendered framebuffer (256x240 ARGB)	Screen stability hash + screenshot capture

These must be exposed as static pointers or arrays — no copying per frame. The test runner reads them synchronously inside the VideoOutput callback, so thread safety is guaranteed by the frame boundary.

8.5 Soft Reset API

Some test ROMs request a console reset by writing $81 to $6000. The emulator must expose a SoftReset() method that resets CPU/APU state without reloading the ROM:

public static void SoftReset()
{
    // Reset CPU: read reset vector from $FFFC/$FFFD, clear registers
    // Reset APU: reinitialize frame counter, silence channels
    // Do NOT reset PPU fully (some tests depend on PPU state surviving reset)
    // Do NOT unload the ROM or reinitialize mapper
}

This is distinct from a hard reset (power cycle). The test runner calls SoftReset() after a 100ms delay (~6 frames) when it detects $6000 == $81.

8.6 Controller Input Injection

Controller tests need programmatic button presses. The emulator must expose button press/release methods:

public static void P1_ButtonPress(byte buttonIndex);   // 0=A,1=B,2=Sel,3=Start,4=Up,5=Down,6=Left,7=Right
public static void P1_ButtonUnPress(byte buttonIndex);

The test runner schedules events by frame number. Each button is pressed at the specified frame and released after a configurable hold duration (default 10 frames ≈ 166ms).

8.7 ROM Loading API

The emulator needs a simple byte-array-based ROM loading interface:

public static bool init(byte[] rom_bytes);  // parse iNES header, set up mapper, reset CPU
public static void run();                    // main emulation loop (blocks until exit==true)

init() returns false for unsupported mappers or corrupt headers. run() is called on a background thread by the test runner, which waits for completion via Thread.Join().

8.8 Architecture Summary

Program.cs

Entry point
GUI / Headless fork

→

TestRunner.cs

Argument parsing
Frame callback
Result detection

→

NesCore

init() / run()
HeadlessMode flags
VideoOutput event

The design principle is minimal coupling: the test runner interacts with the emulator core through 8 touch points (3 flags, 1 event, 3 memory regions, 1 reset method). The emulator core requires zero knowledge of the test runner — it simply checks HeadlessMode to skip GUI creation and fires VideoOutput each frame. All test logic lives in TestRunner.cs.

Interface	Direction	Type
`HeadlessMode`	TestRunner → Core	Static bool flag
`AudioEnabled`	TestRunner → Core	Static bool flag
`LimitFPS`	TestRunner → Core	Static bool flag
`exit`	TestRunner → Core	Static bool flag
`VideoOutput`	Core → TestRunner	Event (per-frame callback)
`NES_MEM / ppu_ram / ScreenBuf1x`	Core → TestRunner	Static memory pointers (read-only)
`SoftReset()`	TestRunner → Core	Static method
`P1_ButtonPress/UnPress()`	TestRunner → Core	Static methods
`init(byte[]) / run()`	TestRunner → Core	Static methods

This architecture makes the QA system portable: any NES emulator that exposes these interfaces can use the same bash scripts and test ROM collection for automated regression testing, regardless of its internal implementation.

9. AI-Assisted Development Workflow with Claude Code

9.1 Overview

AprNes uses Claude Code (Anthropic's CLI agent) as an AI pair-programmer that directly invokes the test shell scripts, reads failure output, diagnoses root causes, edits source code, and verifies fixes — all within a single iterative loop. The test infrastructure described in Sections 1-8 is the foundation that makes this workflow possible.

TODO.md

Pick next bug

→

Analyse

Run failing tests
Read test ROM source

→

Plan

Root cause + fix design

→

Implement

Edit emulator code

→

Verify

Build & run tests

→

Document

bugfix/ + TODO.md
git commit & push

9.2 Prerequisites

Claude Code CLI — installed and authenticated (npm install -g @anthropic-ai/claude-code)
Project memory — .claude/ directory with MEMORY.md recording architecture, build commands, and conventions so the agent retains context across sessions
Test scripts — run_tests.sh (quick validation) and run_tests_report.sh (full report with JSON/screenshots) on PATH
Test ROMs — nes-test-roms-master/checked/ containing all test suites
TODO.md — prioritised bug list with failure details, serving as the task queue

9.3 Step-by-Step Workflow

Phase 1: Task Selection

The developer tells Claude Code to read TODO.md and continue with the next task. Claude picks the highest-priority unfinished bug and identifies the relevant failing tests.

User: 閱讀 TODO.MD，繼續後面任務
Claude: [reads TODO.md, identifies "Bug G — Sprite timing" as next target]
        [runs 4 failing tests with --screenshots to capture current state]

Phase 2: Root Cause Analysis

Claude uses the test infrastructure to diagnose the bug:

Run failing test ROMs individually via the headless CLI, capturing exit codes and screen output
Read test ROM assembly source to understand exactly what the test expects (cycle counts, flag timing, etc.)
Read emulator source code (PPU.cs, CPU.cs, APU.cs, etc.) to find the discrepancy
Cross-reference NES hardware documentation (nesdev wiki) with the emulator implementation

# Claude runs each failing test to see the exact failure message
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
  --rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes
# → FAIL #3: "upper-left corner too late"

Phase 3: Plan Mode

For non-trivial fixes, Claude enters plan mode — a read-only state where it designs the implementation strategy without modifying any files. The plan includes:

Root cause explanation with hardware timing details
Specific code changes with line-level locations
Regression risk analysis (which existing PASS tests might break)
Verification commands to run after implementation

The developer reviews and approves the plan before any code is changed.

Phase 4: Implementation

Claude edits the emulator source files using precise text replacements. Each change is targeted and minimal — only modifying what the plan specified.

Phase 5: Build & Verify

Claude invokes the build toolchain and test scripts directly:

# 1. Build
powershell -NoProfile -Command "MSBuild.exe AprNes.sln /p:Configuration=Debug /t:Rebuild"

# 2. Run target tests (the ones we're trying to fix)
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
  --rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes

# 3. Run regression suite (related tests that might break)
for rom in 01.basics.nes 02.alignment.nes ... 11.edge_timing.nes; do
  ./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
    --rom "nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/$rom"
done

# 4. Full regression (all 174 tests)
bash run_tests.sh

If any test fails unexpectedly, Claude reads the failure output, adjusts the fix, and re-runs. This inner loop repeats until all target tests pass with zero regressions.

Phase 6: Documentation & Commit

Once verified, Claude:

Updates TODO.md — marks the bug as completed, updates the baseline count
Creates bugfix/YYYY-MM-DD_BUGFIXNN.md — detailed record of problem, root cause, fix, and test results
Commits with a descriptive message: fix sprite timing: per-pixel hit + cycle-accurate overflow: 165 PASS / 9 FAIL (+4)
Pushes to remote

Phase 7: Report Generation (on demand)

The developer can request a full HTML report at any time:

bash run_tests_report.sh --json --screenshots
# → report/index.html (interactive dashboard with screenshots)

9.4 Key Files in the Workflow

File	Role
`TODO.md`	Prioritised bug list — the task queue that drives each session
`.claude/memory/MEMORY.md`	Persistent agent memory: architecture, build commands, conventions
`bugfix/YYYY-MM-DD_BUGFIXNN.md`	Per-fix documentation: problem, root cause, changes, verification
`run_tests.sh`	Quick validation — runs all 174 tests, outputs PASS/FAIL counts
`run_tests_report.sh`	Full report — JSON data + screenshots + HTML dashboard
`report/index.html`	Generated interactive test report with filterable results

9.5 Why This Works

Deterministic feedback — Test ROMs produce binary PASS/FAIL with exact failure descriptions (e.g. "upper-left corner too late"), giving the AI precise signals to work with
Headless CLI — The emulator's --wait-result mode returns machine-readable exit codes (0=pass, 1=fail), enabling automated verification loops
Structured knowledge — TODO.md provides prioritised task context; MEMORY.md carries architectural knowledge across sessions; bugfix/ records capture the reasoning behind each change
Fast iteration — Build + single test takes ~5 seconds; full 174-test regression under 10 minutes. The AI can try multiple approaches in one session
Human oversight — Plan mode ensures the developer reviews the approach before any code changes. Git commits are explicit checkpoints

9.6 Example Session: BUGFIX17 (Sprite Timing)

A single Claude Code session progressed from 161 PASS → 165 PASS (+4) by:

Reading TODO.md → identified Bug G (4 failing sprite tests)
Running each failing test → captured exact failure messages (#3 "too late", #5 "set too late", #2 "byte offset bug")
Reading test ROM assembly source (09.timing_basics.asm, 3.Timing.a, 4.Obscure.a) → understood expected cycle counts
Reading PPU.cs → found batch rendering at phase 7 (7-dot latency) and cycle-257 overflow detection
Entered plan mode → designed 3 fixes: per-pixel hit detection, cycle-accurate overflow, hardware overflow bug
Implemented all 3 fixes in PPU.cs (~+120 lines, -40 lines)
Built → ran 4 target tests (all PASS) → ran sprite regression (14 tests, all PASS) → ran full suite (165 PASS / 9 FAIL)
Updated TODO.md, created BUGFIX17.md, committed and pushed

Total: 3 root causes identified, 3 fixes implemented, 0 regressions — all within a single conversation.

9.7 Example Session: BUGFIX19 (DMC DMA Cycle Stealing)

A multi-session Claude Code workflow progressed from 169 PASS → 171 PASS (+2) by tackling the most architecturally challenging bug — DMC DMA cycle stealing:

Read TODO.md → identified Bug F (5 failing DMC DMA tests, marked "requires architecture refactor")
Studied NESdev Wiki DMA reference (from ref/ directory) → discovered Load vs Reload DMA distinction
Read test ROM assembly sources (sync_dmc.s, sprdma_and_dmc_dma.s, dma_2007_read.s, dma_4016_read.s)
Entered plan mode → designed 5-part fix: CPU bus state tracking, PPU-only stolen tick, Load/Reload cycle model, phantom reads, OAM DMA bus tracking
Implemented across 3 core files (MEM.cs, APU.cs, PPU.cs) → +200 lines of cycle-accurate DMA emulation
Discovered 2 tests (dma_2007_read, double_2007_read) produce correct CRC but never print "Passed" → added --expected-crc parameter to TestRunner.cs for CRC-only test support
Iterative debugging: parity-based model failed sync_dmc convergence → replaced with Load/Reload type-based model
Full regression: 171 PASS / 3 FAIL, 0 regressions

This fix required understanding NES DMA at the bus-cycle level: GET/PUT cadence, halt scheduling, write-delay rules, and phantom read /OE contiguity. The remaining 1 DMC test failure (double_2007_read) was identified as a separate PPU buffer latching issue, not a DMA bug.

AprNes Testing Methodology

1. What We Test