TheHans255.com

Brainstorming a Modern Retro Console, Part 1

Or: How I would take an ancient 8-bit processor and make it multithreaded.

by: TheHans255

April 6, 2023

It's time for some mad-scientist project ideas! A few weeks ago, the work I've been doing with the MOS 6502 gave me some inspiration for a modern retro console design that uses the processor, much like the NES and other consoles of the 1980s. I thought: if I wanted to build something that would carry the same design ethos as those consoles, with retro processors and bespoke graphics and audio hardware, yet had the benefit of modern technology and wanted to build something that made more unique and capable games than those original consoles could, what would I build?

Ideally, of course, I would build this design myself, but since I have such a huge backlog of projects, and this particular project requires both several FPGAs and knowledge of Verilog that I haven't really used since college, I thought I might post these ideas and plans to the Internet instead, so that I can share some of my software engineering thoughts about these designs and hopefully inspire someone to go forward and build!

This article comes in 4 parts:

All of these components are more or less independent ideas, especially the first and second parts, so feel free to mix and match for your own creations!

(And of course, all of this comes with a disclaimer: I have not built or tested any of this. All of this comes from my knowledge of the 6502 and other electrical engineering that I have picked up passively over the years, and while I believe that these designs would work to the best of my knowledge, I provide no warranty on any of these designs' effectiveness or their safety in mission critical systems. If I ever get around to physically building these designs, I will update this article with my results.)

Background: The WDC65C02 Processor

First, a quick primer about the main processor we intend to use, the 6502 processor. Or, more specifically, it's CMOS-based 65C02 variant. Why would we want to use it, and what benefits does it provide us as console designers?

The 65C02 processor (full name WDC65C02), is an 8-bit microprocessor that's been around since the early 1980s. It is fully compatible with the original NMOS 65xx line, the chips that powered the Atari 2600, NES, Apple II, Commodore 64, and others, and can run at the same speeds and specs (1.1 MHz, 16 address lines, 8 data lines, 8-bit arithmetic, binary-coded decimal mode), meaning that developers used to those 65xx-based systems can transition more readily to a system using this chip. However, this model has several advantages over the original 6502:

And for that 10 USD a pop, what kind of processor do you get? A processor that can access 64 KiB of memory and other devices, 8 bits at a time, at a rate of up to 14 million times per second (or really 7 million times, considering that most instructions take at least 2 cycles to complete). It can read write memory to a handful of registers, do addition and subtraction, and jump/branch/call code with a 256-byte stack. Absolutely nothing compared to what modern x86_64 or ARM processors can do today, but nothing to sneeze at for the value either. And since any of those 64 KiB of "memory" addresses can actually be wired to any device that responds well to the 65C02's 5 volt signals, it can easily operate many devices and support endless hardware configurations.

Making A 6502 Frankenstein

Now for the fun part - since you've been primed on what the 65C02 is, we can talk about what we're going to do with it: strap a bunch of them together in a demented attempt to let them run multithreaded code!

Since the late 2000s saw a decrease in the raw speed we could give to modern CPUs (which have been stuck between 2 and 4 GHz for over a decade), chip manufacturers instead responded by stuffing multiple "cores" into the same chip, each one being a separate processor that could run code at the same time as the other processors next to it. Writing code to take advantage of multiple processors is harder than writing traditional, single-threaded code, but it allowed much more raw power than what a single processor could provide, and also lays the groundwork for potentially orchestrating far more distributed computing power across a network like the Internet. It stands to reason, that the same principle could be applied to these low-power CPUs as well, and developers could get more raw power with multiple, independent processors at their fingertips.

The 6502 has not had much historical precedent for sharing the same memory across multiple instances of itself - most devices that use a 6502 or 65C02 use only one. However, several 65xx-based devices, such as the Apple II, do have a history of sharing the same RAM with another device: the video controller. Everything in the Apple II - the CPU, the RAM, and the video controller - would run off the same overarching clock, and the CPU and the video controller would take turns, one cycle at a time, accessing the RAM to either modify memory or generate the output image. We will use this same principle to orchestrate our 65C02 processors - each of them will take turns with a central bank of public RAM, each running at their own individual clock speeds while the RAM runs at that speed times the number of CPUs we orchestrate this way.

To show an example of what I mean by this, let's say that an array of 3 processors is running this 65C02 assembly code below:

1000: LDX #$00
1002: LDA $2000,X
1004: CMP #$80
1006: BEQ $1010
1008: INX
1009: BNE $1002

Each processor has its own internal state that puts it at different parts of the program, yet all of them share the same memory, including both this program and the $2000 that it is reading from.

  1. A clock cycle begins on the RAM and on processor 1 (the clock lines on processors 2 and 3 are unchanged). Processor 1 has its Program Counter set to $1008 and is at the instruction fetch step, so it interacts with the RAM to fetch the byte at $1008. It retrieves the byte for the INX instruction and sets itself up to execute that instruction on the next cycle, also incrementing its program counter.
  2. The clock cycle ends, and both the processor and RAM rest.
  3. Another clock cycle begins, this time on the RAM and on processor 2 (the clock lines on processors 1 and 3 remain unchanged). The processor has its Program Counter set to $1000 - it has just entered the subroutine - and is also in the instruction fetch phase. It retrieves the byte for the Immediate-mode LDX instruction and sets itself up to execute that instruction, also incrementing its program counter.
  4. The clock cycle ends again.
  5. Another clock cycle begins, this time on the RAM and on processor 3. This processor has its program counter set to $1006, but this time is performing the last step of the CMP instruction at $1004. This involves subtracting $80 from the value it has already stored in the accumulator (let's say it's $7F, though that doesn't matter for this example) and setting the resulting processor status flags.
  6. The clock cycle ends again.
  7. Yet another clock cycle begins, this time on RAM and processor 1 again. The processor has its program counter set to $1009, but is ready to execute the INX instruction. It increments its internal X register (which is separate than those of the other two processors), and sets the flags.
  8. The above cycle continues, with each processor processing and executing instructions in turn. At any given time, at most one of them has full control of the public memory, meaning that all accesses will remain coherent as if only one processor was accessing them.

An illustration of multiple processors sharing central RAM

The actual full clock speed would either be 14 MHz times the number of processors, or the maximum speed supported by the central RAM chips, whichever is slower. Any number of other devices that need access to the public RAM, such as a video controller, can also take turns as needed in this Round Robin configuration.

As for what exactly is in the central memory, it could really be anything you want, just as it would with any other 6502-based design. A good chunk of it would likely be RAM, though it would also necessarily include some ROM to initialize your processors (both including the interrupt vectors at $FFFA-FFFF as well as code to get the system moving), and may also include some memory-mapped IO to control external devices.

Making it Practical with Private Memory

While the above likely works in principle, it would not be practical from an actual system design standpoint, since there is more state than just the internal processor registers that would need to be kept separate between systems. In particular:

All of these issues can be solved by wiring a small bank of "private memory" to each processor - for addresses below a certain threshold, memory accesses will instead go to this local memory chip instead of the public chips. This memory will be at least 512 bytes to contain the zero page and stack pages, and could contain much more depending on the needs of the system (perhaps up to 16 KB). Somewhere in that area, a single address would instead go to a read-only contraption that gives the processor its identifying number, allowing it to take whichever path it needs to in order to start following orders.

Beyond this, the private memory area can be also designed any way you want, much in the same way the central memory is designed (besides the fact that you would probably not include any ROM besides the chip identifier). This can allow you to control various thread-local or processor-exclusive devices, such as giving each processor a co-processor or serial communications line, or giving one processor exclusive access to the audio controller or external storage media.

Accelerating Processing with Private-Exclusive Mode

With this setup in place, we are already well on our way to a multithreading capable 65C02 system. However, if we're operating on a system where our shared clock speed is limited by the central public memory, we are still limited from reaching the full potential of each processor. Therefore, if a processor doesn't need access to the central memory and can do its processing just fine on its own private memory, then it should be welcome to do so and go up to its full speed without disturbing central memory.

We will do this by including a soft switch in each processor's private memory that allows it to switch between a shared mode and a full speed mode, which the processor can manipulate by accessing two side-by-side memory addresses.

It is likely that the process to switch from one to the other would be a several-cycle process, in which time the processor cannot expect any strict wall-clock timing to its instructions and should not access central memory until the process is complete. If this takes an indeterminate amount of time, the processor would also benefit from a memory-mapped IO address that tells it whether it is connected to central memory.

The primary use of this acceleration mode would be complex batch processing - a processor could, for instance, copy a set of game objects into its private RAM, switch into full-speed to run an O(n2) collision check or sorting operation on them, then go back into shared mode to copy the results back into central RAM. A processor might also put itself into full speed mode in order to perform a timing critical operation - for instance, load an audio sample into private RAM and go into full-speed to play it, knowing that interrupts would not mess up the timing.

Illustration of one chip with private memory attached

Other Notes and Limitations


That's it for part 1! Join me next time for Part 2, where I talk about the even crazier GPU I've designed!


Copyright © 2022-2024, TheHans255. This work is licensed under the CA BY 4.0 license - permission is granted to share and adapt this content for personal and commercial use as long as credit is given to the creator.