← Back to Test Report

AprNes Testing Methodology

NES emulator accuracy verification through hardware test ROMs

174
Test ROMs
30+
Test Suites
5
Subsystems
<3 min
Full Run

1. What We Test

AprNes uses hardware verification test ROMs originally written to validate real NES console behavior. These are actual NES programs (6502 machine code) that exercise specific hardware features and report pass/fail. The same ROMs are used by every major NES emulator project to measure accuracy.

Test coverage spans all major NES subsystems:

2. Test ROM Sources

All ROMs come from the nes-test-roms collection, primarily authored by:

These are the same ROMs used by Mesen, Nestopia, FCEUX, and other reference emulators. Passing them indicates cycle-accurate or near-cycle-accurate emulation.

3. Test Runner Architecture

Bash Script
run_tests_report.sh
MSBuild
Compile emulator
Headless Emulator
TestRunner.cs
Result Detection
$6000 / Screen scan
Output
stdout / JSON / HTML

Headless Mode

The emulator has a built-in TestRunner.cs that runs in headless mode — no window, no audio, no frame rate limiter. The CPU/PPU/APU all tick at maximum speed. A single test ROM typically completes in under 1 second.

Two Scripts, Two Purposes

The testing workflow uses two complementary scripts with identical test lists (174 ROMs):

Report Script Modes

run_tests_report.sh always prints PASS/FAIL results to stdout. Optional flags control additional outputs:

bash run_tests_report.sh                        # Quick: stdout only
bash run_tests_report.sh --json                 # + save report/results.json
bash run_tests_report.sh --screenshots          # + capture screenshots (PNG→WebP)
bash run_tests_report.sh --json --screenshots   # Full: JSON + screenshots + HTML report
bash run_tests_report.sh --no-build             # Skip MSBuild compilation step

The full pipeline (with all flags) performs:

  1. Build the project with MSBuild (unless --no-build)
  2. Run each of 174 ROM files through the headless emulator
  3. Print PASS/FAIL to stdout for every test
  4. If --screenshots: capture the final frame as PNG, convert to lossless WebP
  5. If --json: collect results into report/results.json
  6. If both --json and --screenshots: generate a single-file HTML report (report/index.html) with embedded data and screenshot references

4. Result Detection Mechanisms

Mechanism A: $6000 Memory Protocol

Modern blargg test ROMs use a memory-mapped status protocol. The test runner polls address $6000 every frame:

$6000 ValueMeaningAction
$80Test runningContinue waiting
$81Reset requestedWait 100ms, then soft reset
$00Test passedExit with code 0
$01-$7FTest failed (error code N)Exit with code N

Result text is read from $6004+ as null-terminated ASCII. This gives detailed error messages like "Flag first set too late" or "Length counter not clocked correctly".

Mechanism B: Screen Stability Detection

Older blargg tests (2005 era) don't use the $6000 protocol. They render results directly to the PPU nametable. The test runner handles these with a multi-step heuristic:

  1. After 120 frames (~2 sec), start sampling the screen buffer every frame
  2. Compute a hash of the framebuffer (sampling every 37th pixel for speed)
  3. When the hash stays identical for 90 consecutive frames (~1.5 sec), the screen is "stable"
  4. Scan the PPU nametable (character map) for known result strings:
    • "Passed" / "PASSED" → PASS
    • "Failed" / "FAILED" → FAIL
    • "$01" (hex on screen) → PASS
    • "$02" ~ "$FF" (hex on screen) → FAIL
    • "All tests complete" → PASS
    • " 0/" (zero error count) → PASS

This approach reads the PPU nametable directly (not OCR on pixels), making it fast and reliable.

Mechanism C: CRC Matching

Some test ROMs produce results that depend on random CPU-PPU synchronization at power-on, generating one of several valid CRC values. These tests display a CRC on screen but cannot use a single check_crc call. The --expected-crc parameter accepts a comma-separated list of valid CRCs:

--expected-crc "159A7A8F,5E3DF9C4"

The test runner scans the PPU nametable for an 8-character hexadecimal string (bounded by non-hex characters to avoid partial matches). If found, it compares case-insensitively against the expected set. This mechanism is used in both the screen stability detection path and the timeout fallback path. Currently used by:

5. Automation Features

Auto Soft Reset

Some test ROMs write $81 to $6000 to request a console reset (testing power-on/reset behavior). The runner detects this and automatically performs a soft reset after a 100ms delay, mimicking a human pressing the reset button. Supports up to 10 sequential resets per ROM.

Simulated Controller Input

Controller read tests need actual button presses. The --input parameter schedules timed button events:

--input "A:2.0,B:4.0,Select:6.0,Start:8.0,Up:10.0,Down:12.0,Left:14.0,Right:16.0"

Each button is pressed at the specified time (seconds) and held for 10 frames (~166ms). This lets tests like read_joy3/test_buttons verify that all 8 buttons are correctly detected in sequence.

Screenshot Capture (optional)

When enabled with --screenshots, the final frame of each test is captured as a 256x240 PNG, then converted to lossless WebP (typically 60-80% smaller). Screenshots serve as visual evidence — many test ROMs display their results on screen as text, showing exactly what passed or failed. For quick regression checks, screenshots can be skipped to save time.

Timeout Safety

Each ROM has a configurable --max-wait timeout (default 30 seconds, 120 for merged/multi-sub-test ROMs). If a test ROM enters an infinite loop or hangs, the runner terminates it gracefully and reports the last known state.

6. Test Suite Coverage

4apu_mixer — Channel mixing
6apu_reset — APU power/reset
9apu_test — APU frame counter
11blargg_apu_2005 — APU timing
2blargg_cpu_test5 — CPU instructions
5blargg_ppu_tests — PPU basics
3branch_timing — Branch cycle count
1cpu_dummy_reads — Dummy read cycles
2cpu_dummy_writes — Dummy write cycles
2cpu_exec_space — Execution from I/O
6cpu_interrupts_v2 — NMI/IRQ interaction
2cpu_reset — CPU reset behavior
1cpu_timing_test6 — Instruction timing
5dmc_dma_during_read — DMC DMA conflicts
5instr_misc — Misc instruction tests
17instr_test-v3 — All 6502 instructions
18instr_test-v5 — All 6502 instructions (v5)
3instr_timing — Instruction cycle timing
6mmc3_irq_tests — MMC3 IRQ counter
6mmc3_test — MMC3 behavior
6mmc3_test_2 — MMC3 behavior (v2)
11nes_instr_test — CPU instructions (alt)
1oam_read — OAM read behavior
1ppu_open_bus — PPU open bus
1ppu_read_buffer — PPU read buffer
11ppu_vbl_nmi — VBlank/NMI timing
4read_joy3 — Controller reading
2sprdma_and_dmc_dma — DMA conflicts
11sprite_hit_tests — Sprite 0 hit
5sprite_overflow — Sprite overflow
7vbl_nmi_timing — VBL/NMI timing

7. Command-Line Interface

The headless test runner is invoked directly via the emulator executable:

AprNes.exe --rom <file.nes> [options]
OptionDescription
--rom <path>ROM file to load (required)
--wait-resultMonitor $6000 / screen for test result
--max-wait <sec>Timeout in seconds (default: 30)
--time <sec>Run for exactly N seconds, then stop
--screenshot <path>Save final frame as PNG
--log <path>Write result line to file
--soft-reset <sec>Trigger soft reset at N seconds
--input <spec>Schedule button presses (e.g. "A:2.0,B:4.0")
--expected-crc <list>Comma-separated valid CRCs for CRC-only tests (e.g. "159A7A8F,5E3DF9C4")
--debug-log <path>Write CPU trace log

Exit codes: 0 = pass, 1-127 = fail (test error code), 255 = timeout/no result.

8. Emulator Design Requirements for QA Integration

To adopt this automated QA workflow, an emulator must expose a set of design interfaces. Below is a reference based on AprNes's architecture — the pattern is applicable to any NES emulator regardless of language.

8.1 Dual Entry Point: GUI vs Headless

The emulator should support two modes of operation from a single executable. When command-line arguments are present, it enters headless test mode; otherwise it launches the normal GUI.

// Program.cs — entry point
static int Main(string[] args)
{
    if (args.Length > 0)
        return TestRunner.Run(args);    // headless mode, returns exit code

    Application.Run(new MainForm());    // normal GUI mode
    return 0;
}

Key point: Main() returns int (not void) so the exit code can signal pass/fail to the calling script.

8.2 Headless Mode Flags

The emulator core needs static flags to suppress GUI and audio subsystems when running in test mode:

FlagPurposeEffect
HeadlessMode = trueSuppress window creationNo Form/Window is instantiated; rendering still runs to fill the framebuffer, but no display output occurs
AudioEnabled = falseSuppress audio outputAPU still ticks (needed for timing tests), but no audio device is opened
LimitFPS = falseRemove frame rate limiterEmulation runs at maximum CPU speed; a 30-second test completes in <1 second wall time
exit = trueSignal the main loop to stopSet by the test runner when a result is detected; the run() loop checks this flag each frame

These flags allow the CPU/PPU/APU to continue ticking normally — only the I/O endpoints (display, speakers) are disabled. This ensures that timing-sensitive tests produce identical results in headless and GUI modes.

8.3 Per-Frame Callback (VideoOutput Event)

The emulator must provide a per-frame hook that fires after each frame is fully rendered. The test runner subscribes to this event to poll for results:

// In NesCore (emulator core):
public static event EventHandler VideoOutput;

// Fired at the end of each PPU frame (scanline 240, after VBlank begins):
VideoOutput?.Invoke(null, null);

// In TestRunner:
NesCore.VideoOutput += (sender, e) => {
    frameCount++;
    byte status = NesCore.NES_MEM[0x6000];  // poll test protocol
    // ... detect result, set NesCore.exit = true when done
};

This callback-driven design avoids tight polling loops and integrates cleanly with both GUI (refresh display) and headless (check test status) modes.

8.4 Memory and PPU RAM Access

The test runner needs direct read access to two memory regions:

Memory RegionAccess PatternPurpose
NES_MEM[0x6000..0x6FFF]CPU address space (WRAM)Read $6000 status byte and $6004+ result text (blargg protocol)
ppu_ram[0x2000..0x23BF]PPU nametable 0Scan for "Passed"/"Failed" text on screen (older test ROMs)
ScreenBuf1x[0..61439]Rendered framebuffer (256x240 ARGB)Screen stability hash + screenshot capture

These must be exposed as static pointers or arrays — no copying per frame. The test runner reads them synchronously inside the VideoOutput callback, so thread safety is guaranteed by the frame boundary.

8.5 Soft Reset API

Some test ROMs request a console reset by writing $81 to $6000. The emulator must expose a SoftReset() method that resets CPU/APU state without reloading the ROM:

public static void SoftReset()
{
    // Reset CPU: read reset vector from $FFFC/$FFFD, clear registers
    // Reset APU: reinitialize frame counter, silence channels
    // Do NOT reset PPU fully (some tests depend on PPU state surviving reset)
    // Do NOT unload the ROM or reinitialize mapper
}

This is distinct from a hard reset (power cycle). The test runner calls SoftReset() after a 100ms delay (~6 frames) when it detects $6000 == $81.

8.6 Controller Input Injection

Controller tests need programmatic button presses. The emulator must expose button press/release methods:

public static void P1_ButtonPress(byte buttonIndex);   // 0=A,1=B,2=Sel,3=Start,4=Up,5=Down,6=Left,7=Right
public static void P1_ButtonUnPress(byte buttonIndex);

The test runner schedules events by frame number. Each button is pressed at the specified frame and released after a configurable hold duration (default 10 frames ≈ 166ms).

8.7 ROM Loading API

The emulator needs a simple byte-array-based ROM loading interface:

public static bool init(byte[] rom_bytes);  // parse iNES header, set up mapper, reset CPU
public static void run();                    // main emulation loop (blocks until exit==true)

init() returns false for unsupported mappers or corrupt headers. run() is called on a background thread by the test runner, which waits for completion via Thread.Join().

8.8 Architecture Summary

Program.cs
Entry point
GUI / Headless fork
TestRunner.cs
Argument parsing
Frame callback
Result detection
NesCore
init() / run()
HeadlessMode flags
VideoOutput event

The design principle is minimal coupling: the test runner interacts with the emulator core through 8 touch points (3 flags, 1 event, 3 memory regions, 1 reset method). The emulator core requires zero knowledge of the test runner — it simply checks HeadlessMode to skip GUI creation and fires VideoOutput each frame. All test logic lives in TestRunner.cs.

InterfaceDirectionType
HeadlessModeTestRunner → CoreStatic bool flag
AudioEnabledTestRunner → CoreStatic bool flag
LimitFPSTestRunner → CoreStatic bool flag
exitTestRunner → CoreStatic bool flag
VideoOutputCore → TestRunnerEvent (per-frame callback)
NES_MEM / ppu_ram / ScreenBuf1xCore → TestRunnerStatic memory pointers (read-only)
SoftReset()TestRunner → CoreStatic method
P1_ButtonPress/UnPress()TestRunner → CoreStatic methods
init(byte[]) / run()TestRunner → CoreStatic methods

This architecture makes the QA system portable: any NES emulator that exposes these interfaces can use the same bash scripts and test ROM collection for automated regression testing, regardless of its internal implementation.

9. AI-Assisted Development Workflow with Claude Code

9.1 Overview

AprNes uses Claude Code (Anthropic's CLI agent) as an AI pair-programmer that directly invokes the test shell scripts, reads failure output, diagnoses root causes, edits source code, and verifies fixes — all within a single iterative loop. The test infrastructure described in Sections 1-8 is the foundation that makes this workflow possible.

TODO.md
Pick next bug
Analyse
Run failing tests
Read test ROM source
Plan
Root cause + fix design
Implement
Edit emulator code
Verify
Build & run tests
Document
bugfix/ + TODO.md
git commit & push

9.2 Prerequisites

9.3 Step-by-Step Workflow

Phase 1: Task Selection

The developer tells Claude Code to read TODO.md and continue with the next task. Claude picks the highest-priority unfinished bug and identifies the relevant failing tests.

User: 閱讀 TODO.MD,繼續後面任務
Claude: [reads TODO.md, identifies "Bug G — Sprite timing" as next target]
        [runs 4 failing tests with --screenshots to capture current state]

Phase 2: Root Cause Analysis

Claude uses the test infrastructure to diagnose the bug:

# Claude runs each failing test to see the exact failure message
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
  --rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes
# → FAIL #3: "upper-left corner too late"

Phase 3: Plan Mode

For non-trivial fixes, Claude enters plan mode — a read-only state where it designs the implementation strategy without modifying any files. The plan includes:

The developer reviews and approves the plan before any code is changed.

Phase 4: Implementation

Claude edits the emulator source files using precise text replacements. Each change is targeted and minimal — only modifying what the plan specified.

Phase 5: Build & Verify

Claude invokes the build toolchain and test scripts directly:

# 1. Build
powershell -NoProfile -Command "MSBuild.exe AprNes.sln /p:Configuration=Debug /t:Rebuild"

# 2. Run target tests (the ones we're trying to fix)
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
  --rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes

# 3. Run regression suite (related tests that might break)
for rom in 01.basics.nes 02.alignment.nes ... 11.edge_timing.nes; do
  ./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
    --rom "nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/$rom"
done

# 4. Full regression (all 174 tests)
bash run_tests.sh

If any test fails unexpectedly, Claude reads the failure output, adjusts the fix, and re-runs. This inner loop repeats until all target tests pass with zero regressions.

Phase 6: Documentation & Commit

Once verified, Claude:

Phase 7: Report Generation (on demand)

The developer can request a full HTML report at any time:

bash run_tests_report.sh --json --screenshots
# → report/index.html (interactive dashboard with screenshots)

9.4 Key Files in the Workflow

FileRole
TODO.mdPrioritised bug list — the task queue that drives each session
.claude/memory/MEMORY.mdPersistent agent memory: architecture, build commands, conventions
bugfix/YYYY-MM-DD_BUGFIXNN.mdPer-fix documentation: problem, root cause, changes, verification
run_tests.shQuick validation — runs all 174 tests, outputs PASS/FAIL counts
run_tests_report.shFull report — JSON data + screenshots + HTML dashboard
report/index.htmlGenerated interactive test report with filterable results

9.5 Why This Works

9.6 Example Session: BUGFIX17 (Sprite Timing)

A single Claude Code session progressed from 161 PASS → 165 PASS (+4) by:

  1. Reading TODO.md → identified Bug G (4 failing sprite tests)
  2. Running each failing test → captured exact failure messages (#3 "too late", #5 "set too late", #2 "byte offset bug")
  3. Reading test ROM assembly source (09.timing_basics.asm, 3.Timing.a, 4.Obscure.a) → understood expected cycle counts
  4. Reading PPU.cs → found batch rendering at phase 7 (7-dot latency) and cycle-257 overflow detection
  5. Entered plan mode → designed 3 fixes: per-pixel hit detection, cycle-accurate overflow, hardware overflow bug
  6. Implemented all 3 fixes in PPU.cs (~+120 lines, -40 lines)
  7. Built → ran 4 target tests (all PASS) → ran sprite regression (14 tests, all PASS) → ran full suite (165 PASS / 9 FAIL)
  8. Updated TODO.md, created BUGFIX17.md, committed and pushed

Total: 3 root causes identified, 3 fixes implemented, 0 regressions — all within a single conversation.

9.7 Example Session: BUGFIX19 (DMC DMA Cycle Stealing)

A multi-session Claude Code workflow progressed from 169 PASS → 171 PASS (+2) by tackling the most architecturally challenging bug — DMC DMA cycle stealing:

  1. Read TODO.md → identified Bug F (5 failing DMC DMA tests, marked "requires architecture refactor")
  2. Studied NESdev Wiki DMA reference (from ref/ directory) → discovered Load vs Reload DMA distinction
  3. Read test ROM assembly sources (sync_dmc.s, sprdma_and_dmc_dma.s, dma_2007_read.s, dma_4016_read.s)
  4. Entered plan mode → designed 5-part fix: CPU bus state tracking, PPU-only stolen tick, Load/Reload cycle model, phantom reads, OAM DMA bus tracking
  5. Implemented across 3 core files (MEM.cs, APU.cs, PPU.cs) → +200 lines of cycle-accurate DMA emulation
  6. Discovered 2 tests (dma_2007_read, double_2007_read) produce correct CRC but never print "Passed" → added --expected-crc parameter to TestRunner.cs for CRC-only test support
  7. Iterative debugging: parity-based model failed sync_dmc convergence → replaced with Load/Reload type-based model
  8. Full regression: 171 PASS / 3 FAIL, 0 regressions

This fix required understanding NES DMA at the bus-cycle level: GET/PUT cadence, halt scheduling, write-delay rules, and phantom read /OE contiguity. The remaining 1 DMC test failure (double_2007_read) was identified as a separate PPU buffer latching issue, not a DMA bug.