NES emulator accuracy verification through hardware test ROMs
AprNes uses hardware verification test ROMs originally written to validate real NES console behavior. These are actual NES programs (6502 machine code) that exercise specific hardware features and report pass/fail. The same ROMs are used by every major NES emulator project to measure accuracy.
Test coverage spans all major NES subsystems:
All ROMs come from the nes-test-roms collection, primarily authored by:
These are the same ROMs used by Mesen, Nestopia, FCEUX, and other reference emulators. Passing them indicates cycle-accurate or near-cycle-accurate emulation.
The emulator has a built-in TestRunner.cs that runs in headless mode — no window, no audio, no frame rate limiter. The CPU/PPU/APU all tick at maximum speed. A single test ROM typically completes in under 1 second.
The testing workflow uses two complementary scripts with identical test lists (174 ROMs):
run_tests.sh — Lightweight validation script. Runs all tests and prints PASS/FAIL to stdout with a failure summary. Used for quick regression checks during development.run_tests_report.sh — Full-featured report generator with modular output controlled by command-line flags.run_tests_report.sh always prints PASS/FAIL results to stdout. Optional flags control additional outputs:
bash run_tests_report.sh # Quick: stdout only bash run_tests_report.sh --json # + save report/results.json bash run_tests_report.sh --screenshots # + capture screenshots (PNG→WebP) bash run_tests_report.sh --json --screenshots # Full: JSON + screenshots + HTML report bash run_tests_report.sh --no-build # Skip MSBuild compilation step
The full pipeline (with all flags) performs:
--no-build)--screenshots: capture the final frame as PNG, convert to lossless WebP--json: collect results into report/results.json--json and --screenshots: generate a single-file HTML report (report/index.html) with embedded data and screenshot referencesModern blargg test ROMs use a memory-mapped status protocol. The test runner polls address $6000 every frame:
| $6000 Value | Meaning | Action |
|---|---|---|
$80 | Test running | Continue waiting |
$81 | Reset requested | Wait 100ms, then soft reset |
$00 | Test passed | Exit with code 0 |
$01-$7F | Test failed (error code N) | Exit with code N |
Result text is read from $6004+ as null-terminated ASCII. This gives detailed error messages like "Flag first set too late" or "Length counter not clocked correctly".
Older blargg tests (2005 era) don't use the $6000 protocol. They render results directly to the PPU nametable. The test runner handles these with a multi-step heuristic:
"Passed" / "PASSED" → PASS"Failed" / "FAILED" → FAIL"$01" (hex on screen) → PASS"$02" ~ "$FF" (hex on screen) → FAIL"All tests complete" → PASS" 0/" (zero error count) → PASSThis approach reads the PPU nametable directly (not OCR on pixels), making it fast and reliable.
Some test ROMs produce results that depend on random CPU-PPU synchronization at power-on, generating one of several valid CRC values. These tests display a CRC on screen but cannot use a single check_crc call. The --expected-crc parameter accepts a comma-separated list of valid CRCs:
--expected-crc "159A7A8F,5E3DF9C4"
The test runner scans the PPU nametable for an 8-character hexadecimal string (bounded by non-hex characters to avoid partial matches). If found, it compares case-insensitively against the expected set. This mechanism is used in both the screen stability detection path and the timeout fallback path. Currently used by:
dma_2007_read — 2 valid CRCs (CPU-PPU sync dependent)double_2007_read — 4 valid CRCs (CPU-PPU sync dependent)Some test ROMs write $81 to $6000 to request a console reset (testing power-on/reset behavior). The runner detects this and automatically performs a soft reset after a 100ms delay, mimicking a human pressing the reset button. Supports up to 10 sequential resets per ROM.
Controller read tests need actual button presses. The --input parameter schedules timed button events:
--input "A:2.0,B:4.0,Select:6.0,Start:8.0,Up:10.0,Down:12.0,Left:14.0,Right:16.0"
Each button is pressed at the specified time (seconds) and held for 10 frames (~166ms). This lets tests like read_joy3/test_buttons verify that all 8 buttons are correctly detected in sequence.
When enabled with --screenshots, the final frame of each test is captured as a 256x240 PNG, then converted to lossless WebP (typically 60-80% smaller). Screenshots serve as visual evidence — many test ROMs display their results on screen as text, showing exactly what passed or failed. For quick regression checks, screenshots can be skipped to save time.
Each ROM has a configurable --max-wait timeout (default 30 seconds, 120 for merged/multi-sub-test ROMs). If a test ROM enters an infinite loop or hangs, the runner terminates it gracefully and reports the last known state.
The headless test runner is invoked directly via the emulator executable:
AprNes.exe --rom <file.nes> [options]
| Option | Description |
|---|---|
--rom <path> | ROM file to load (required) |
--wait-result | Monitor $6000 / screen for test result |
--max-wait <sec> | Timeout in seconds (default: 30) |
--time <sec> | Run for exactly N seconds, then stop |
--screenshot <path> | Save final frame as PNG |
--log <path> | Write result line to file |
--soft-reset <sec> | Trigger soft reset at N seconds |
--input <spec> | Schedule button presses (e.g. "A:2.0,B:4.0") |
--expected-crc <list> | Comma-separated valid CRCs for CRC-only tests (e.g. "159A7A8F,5E3DF9C4") |
--debug-log <path> | Write CPU trace log |
Exit codes: 0 = pass, 1-127 = fail (test error code), 255 = timeout/no result.
To adopt this automated QA workflow, an emulator must expose a set of design interfaces. Below is a reference based on AprNes's architecture — the pattern is applicable to any NES emulator regardless of language.
The emulator should support two modes of operation from a single executable. When command-line arguments are present, it enters headless test mode; otherwise it launches the normal GUI.
// Program.cs — entry point
static int Main(string[] args)
{
if (args.Length > 0)
return TestRunner.Run(args); // headless mode, returns exit code
Application.Run(new MainForm()); // normal GUI mode
return 0;
}
Key point: Main() returns int (not void) so the exit code can signal pass/fail to the calling script.
The emulator core needs static flags to suppress GUI and audio subsystems when running in test mode:
| Flag | Purpose | Effect |
|---|---|---|
HeadlessMode = true | Suppress window creation | No Form/Window is instantiated; rendering still runs to fill the framebuffer, but no display output occurs |
AudioEnabled = false | Suppress audio output | APU still ticks (needed for timing tests), but no audio device is opened |
LimitFPS = false | Remove frame rate limiter | Emulation runs at maximum CPU speed; a 30-second test completes in <1 second wall time |
exit = true | Signal the main loop to stop | Set by the test runner when a result is detected; the run() loop checks this flag each frame |
These flags allow the CPU/PPU/APU to continue ticking normally — only the I/O endpoints (display, speakers) are disabled. This ensures that timing-sensitive tests produce identical results in headless and GUI modes.
The emulator must provide a per-frame hook that fires after each frame is fully rendered. The test runner subscribes to this event to poll for results:
// In NesCore (emulator core):
public static event EventHandler VideoOutput;
// Fired at the end of each PPU frame (scanline 240, after VBlank begins):
VideoOutput?.Invoke(null, null);
// In TestRunner:
NesCore.VideoOutput += (sender, e) => {
frameCount++;
byte status = NesCore.NES_MEM[0x6000]; // poll test protocol
// ... detect result, set NesCore.exit = true when done
};
This callback-driven design avoids tight polling loops and integrates cleanly with both GUI (refresh display) and headless (check test status) modes.
The test runner needs direct read access to two memory regions:
| Memory Region | Access Pattern | Purpose |
|---|---|---|
NES_MEM[0x6000..0x6FFF] | CPU address space (WRAM) | Read $6000 status byte and $6004+ result text (blargg protocol) |
ppu_ram[0x2000..0x23BF] | PPU nametable 0 | Scan for "Passed"/"Failed" text on screen (older test ROMs) |
ScreenBuf1x[0..61439] | Rendered framebuffer (256x240 ARGB) | Screen stability hash + screenshot capture |
These must be exposed as static pointers or arrays — no copying per frame. The test runner reads them synchronously inside the VideoOutput callback, so thread safety is guaranteed by the frame boundary.
Some test ROMs request a console reset by writing $81 to $6000. The emulator must expose a SoftReset() method that resets CPU/APU state without reloading the ROM:
public static void SoftReset()
{
// Reset CPU: read reset vector from $FFFC/$FFFD, clear registers
// Reset APU: reinitialize frame counter, silence channels
// Do NOT reset PPU fully (some tests depend on PPU state surviving reset)
// Do NOT unload the ROM or reinitialize mapper
}
This is distinct from a hard reset (power cycle). The test runner calls SoftReset() after a 100ms delay (~6 frames) when it detects $6000 == $81.
Controller tests need programmatic button presses. The emulator must expose button press/release methods:
public static void P1_ButtonPress(byte buttonIndex); // 0=A,1=B,2=Sel,3=Start,4=Up,5=Down,6=Left,7=Right public static void P1_ButtonUnPress(byte buttonIndex);
The test runner schedules events by frame number. Each button is pressed at the specified frame and released after a configurable hold duration (default 10 frames ≈ 166ms).
The emulator needs a simple byte-array-based ROM loading interface:
public static bool init(byte[] rom_bytes); // parse iNES header, set up mapper, reset CPU public static void run(); // main emulation loop (blocks until exit==true)
init() returns false for unsupported mappers or corrupt headers. run() is called on a background thread by the test runner, which waits for completion via Thread.Join().
The design principle is minimal coupling: the test runner interacts with the emulator core through 8 touch points (3 flags, 1 event, 3 memory regions, 1 reset method). The emulator core requires zero knowledge of the test runner — it simply checks HeadlessMode to skip GUI creation and fires VideoOutput each frame. All test logic lives in TestRunner.cs.
| Interface | Direction | Type |
|---|---|---|
HeadlessMode | TestRunner → Core | Static bool flag |
AudioEnabled | TestRunner → Core | Static bool flag |
LimitFPS | TestRunner → Core | Static bool flag |
exit | TestRunner → Core | Static bool flag |
VideoOutput | Core → TestRunner | Event (per-frame callback) |
NES_MEM / ppu_ram / ScreenBuf1x | Core → TestRunner | Static memory pointers (read-only) |
SoftReset() | TestRunner → Core | Static method |
P1_ButtonPress/UnPress() | TestRunner → Core | Static methods |
init(byte[]) / run() | TestRunner → Core | Static methods |
This architecture makes the QA system portable: any NES emulator that exposes these interfaces can use the same bash scripts and test ROM collection for automated regression testing, regardless of its internal implementation.
AprNes uses Claude Code (Anthropic's CLI agent) as an AI pair-programmer that directly invokes the test shell scripts, reads failure output, diagnoses root causes, edits source code, and verifies fixes — all within a single iterative loop. The test infrastructure described in Sections 1-8 is the foundation that makes this workflow possible.
npm install -g @anthropic-ai/claude-code).claude/ directory with MEMORY.md recording architecture, build commands, and conventions so the agent retains context across sessionsrun_tests.sh (quick validation) and run_tests_report.sh (full report with JSON/screenshots) on PATHnes-test-roms-master/checked/ containing all test suitesThe developer tells Claude Code to read TODO.md and continue with the next task. Claude picks the highest-priority unfinished bug and identifies the relevant failing tests.
User: 閱讀 TODO.MD,繼續後面任務
Claude: [reads TODO.md, identifies "Bug G — Sprite timing" as next target]
[runs 4 failing tests with --screenshots to capture current state]
Claude uses the test infrastructure to diagnose the bug:
# Claude runs each failing test to see the exact failure message ./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \ --rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes # → FAIL #3: "upper-left corner too late"
For non-trivial fixes, Claude enters plan mode — a read-only state where it designs the implementation strategy without modifying any files. The plan includes:
The developer reviews and approves the plan before any code is changed.
Claude edits the emulator source files using precise text replacements. Each change is targeted and minimal — only modifying what the plan specified.
Claude invokes the build toolchain and test scripts directly:
# 1. Build
powershell -NoProfile -Command "MSBuild.exe AprNes.sln /p:Configuration=Debug /t:Rebuild"
# 2. Run target tests (the ones we're trying to fix)
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
--rom nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/09.timing_basics.nes
# 3. Run regression suite (related tests that might break)
for rom in 01.basics.nes 02.alignment.nes ... 11.edge_timing.nes; do
./AprNes/bin/Debug/AprNes.exe --wait-result --max-wait 30 \
--rom "nes-test-roms-master/checked/sprite_hit_tests_2005.10.05/$rom"
done
# 4. Full regression (all 174 tests)
bash run_tests.sh
If any test fails unexpectedly, Claude reads the failure output, adjusts the fix, and re-runs. This inner loop repeats until all target tests pass with zero regressions.
Once verified, Claude:
TODO.md — marks the bug as completed, updates the baseline countbugfix/YYYY-MM-DD_BUGFIXNN.md — detailed record of problem, root cause, fix, and test resultsfix sprite timing: per-pixel hit + cycle-accurate overflow: 165 PASS / 9 FAIL (+4)The developer can request a full HTML report at any time:
bash run_tests_report.sh --json --screenshots # → report/index.html (interactive dashboard with screenshots)
| File | Role |
|---|---|
TODO.md | Prioritised bug list — the task queue that drives each session |
.claude/memory/MEMORY.md | Persistent agent memory: architecture, build commands, conventions |
bugfix/YYYY-MM-DD_BUGFIXNN.md | Per-fix documentation: problem, root cause, changes, verification |
run_tests.sh | Quick validation — runs all 174 tests, outputs PASS/FAIL counts |
run_tests_report.sh | Full report — JSON data + screenshots + HTML dashboard |
report/index.html | Generated interactive test report with filterable results |
--wait-result mode returns machine-readable exit codes (0=pass, 1=fail), enabling automated verification loopsTODO.md provides prioritised task context; MEMORY.md carries architectural knowledge across sessions; bugfix/ records capture the reasoning behind each changeA single Claude Code session progressed from 161 PASS → 165 PASS (+4) by:
TODO.md → identified Bug G (4 failing sprite tests)Total: 3 root causes identified, 3 fixes implemented, 0 regressions — all within a single conversation.
A multi-session Claude Code workflow progressed from 169 PASS → 171 PASS (+2) by tackling the most architecturally challenging bug — DMC DMA cycle stealing:
TODO.md → identified Bug F (5 failing DMC DMA tests, marked "requires architecture refactor")ref/ directory) → discovered Load vs Reload DMA distinction--expected-crc parameter to TestRunner.cs for CRC-only test supportThis fix required understanding NES DMA at the bus-cycle level: GET/PUT cadence, halt scheduling, write-delay rules, and phantom read /OE contiguity. The remaining 1 DMC test failure (double_2007_read) was identified as a separate PPU buffer latching issue, not a DMA bug.