Research notes and writeups accumulated during AprNes development. From NES hardware simulation theory to C# / .NET microarchitecture-level performance optimisation, organised by category. Articles will keep being added.
A Q&A-style coverage of the main implementation techniques behind modern emulators: JIT, DBT, KVM, LLVM backend, Static Recompilation, HLE GPU bridging, AI texture upscaling, rollback netcode, MSU-1 audio replacement, FPGA simulation, Visual6502 transistor-level simulation, and Lean formal verification. The opening section explicitly separates "language JIT" (.NET / .NET Framework / Java) from "emulator JIT / Dynarec" — two concepts most newcomers conflate. The remaining seven sections walk through each technique's applicability, limits, and implementation challenges, plus suggested practice targets for the four core techniques (JIT / DBT / KVM / Static Recomp).
Walks through Nintendo's 12 mainstream home consoles and handhelds in release-date order (NES → Game Boy → SNES → N64 → GBC → GBA → GameCube → NDS → Wii → 3DS → Wii U → Switch). Each console covers four aspects: hardware architecture and design philosophy, release date, core implementation challenges (CPU sync, PPU/GPU pipeline, encryption, HLE/LLE trade-off, etc.), and notable open-source emulator references. Closes with a difficulty ranking and a suggested learning path — for developers wanting to write their own emulator or evaluate which console to take on.
A linear tutorial designed for "programmers who can code but aren't comfortable with hardware." Starts from NES hardware concepts and walks through every subsystem against the AprNes / NesCore implementation: ROM loading, CPU bus, the 6502 core, master-clock synchronisation, PPU memory pipeline, APU, DMA, controllers, and Mappers 0-4. Throughout, it grounds abstract concepts with everyday analogies (kitchen / chef / counter / conveyor). Each chapter follows a consistent shape: hardware concepts → beginner-friendly model → AprNes implementation mapping → common mistakes → recap. The two appendices (A1 computer organization primer and A2 complete 256-opcode reference) work as standalone references.
Catalogues every performance change on AprNes master since 2026-03-15,
focused on below the language/runtime level. Covers 11
categories: bitwise tricks (& instead of %),
branchless code, lookup tables (LUT), magic numbers, SWAR, true SIMD,
integer-for-float (Bresenham / fixed-point), loop unrolling and ILP,
function-pointer static dispatch, cache-line-aware data layout, and
redundancy elimination. Each section references real before/after
commits.
Starting from the game loop, walks through CPU cache hierarchy, hot/cold path splitting, multi-core pipelining, and thread affinity, with the actual PMU / ETW analysis workflow used in AprNes. Covers the inlining- vs-I-cache trade-off, quantitative tools and laddered strategy, what to do when the hot path overflows, inter-core communication cost, how to ensure C# threads actually land on different cores, and finishes by extending these ideas to high-concurrency web services. Targets developers who want to understand performance from the JIT-behaviour and CPU-microarchitecture level.
How finely should an emulator simulate "time"? It looks like a performance question, but it's also an architecture, correctness, engineering-cost, and maintainability question. The article classifies common timing models from frame-based, scanline, cycle, dot, all the way to sub-cycle / master clock — comparing what each tier buys, what it costs, and which test ROMs it can pass. A starting reference for designing a new emulator or assessing an existing one's accuracy.
Catch-up is the dominant cost-saving approach in many emulators — let one
component run ahead and let other components catch up later when needed.
But in cycle-accurate designs, the side effects of catch-up (delayed
decisions, patchwork logic, accuracy degradation) become a liability.
The article explains why AprNes chose the
Mem_r → tick() → 3× ppu_step global tick model and what
hot-loop optimisations that structure enables.
Many people think a per-scanline emulator is just coarser, less precise,
and easier to write. In practice, getting a per-scanline emulator to run
most games without breaking requires layering many hacks and special
cases: MMC3 IRQ timing, PPU internal state when $2007 is
written, the precise cycle of sprite-0-hit, DMA cycle stealing, and so
on. The article enumerates these hidden costs and explains why going
cycle-accurate is actually the cleaner design.
New tutorials and writeups will keep coming, including NTSC signal simulation, CRT shader design, SIMD programming practices, cross-platform porting experience, and more. Discussions and topic suggestions on GitHub issues are also welcome.
Visit GitHub