James revisits the world of 8-bit game programming in creating the Gameduino, an 8-bit graphics and sound shield for the Arduino.
The 80s: big hair, muscle cars, and graphics systems that were... interesting. To save precious bytes, early systems like Nintendo had graphics displays that were not directly pixel-addressable. Instead, they let you define a few hundred background tiles, typically 8x8 pixel bitmaps. The display was made by reference to these tiles, using a dedicated RAM called the tile map.
Systems like this were very efficient for rendering Mario World, with its repeating block elements, but had trouble showing arbitrary images. Eighties artists spent hours—sometimes days—working around this restriction, carefully crafting a screen image to fit in the limited tile budget.
Since we now live in the glittering future, is it possible to automate the process, and show arbitrary images on this hardware?
Gameduino: Tiles Are Back!
I recently revisited this world when I made the Gameduino, an 8-bit graphics and sound shield for the Arduino. The background graphics are tile-based; the tiles are 8x8 pixels, and the screen is 50 tiles wide by 38 tiles high. As you might expect from an 8-bit graphics system, there are 256 unique tiles.
Instead of a storage location for every pixel on the screen, the system only needs storage for a tile map—the array of bytes that give tile numbers— and the graphics for the tiles themselves. This saves a lot of RAM, and has other advantages.
First, it can make drawing graphical elements very efficient: for example to paint a 32x56-pixel element, the CPU needs to update a 4x7 rectangle to the tile map, so it only writes 28 bytes of video memory. In the days of 8-bit CPUs running at a Megahertz, this would have been a significant economy.
Second, 8-bit game authors used the tile map hardware for animation effects. A small graphics update can cause a dramatic on-screen change by using a tile repeatedly on the screen and then changing the tile’s graphics. Conveyor belts, ocean waves, lightning strikes: this trick is one of the staples of 8-bit game animation.
Displaying Any Image
Since there are 256 tiles available, drawing an image is straightforward if it uses 256 or fewer tiles. If the image is small, say 128x128 pixels, then it requires 16x16=256 tiles. Easy enough.
The challenge is how to deal with a larger image. The maximum case, of an image that covers the whole screen, uses 2000 tiles. With only 256 unique tiles available, each tile should be used an average eight times. Hopefully, some parts of the image contain less detail—for example a dark shadow—and can share the same tile. Some parts of an image contain such important detail—an eye, for example—that they should be allocated a tile that is used only once.
The Math: K-means Clustering
This is a classic scarcity problem: we’d like 2000 unique tiles, but only have 256. The perfect tool for this job is k-means clustering. The k-means algorithm clusters a collection into groups based on likeness: we supply it with 2000 tiles, and after a prodigious amount of computation, it gives back 256 clusters of tiles, where the tiles in each cluster are reasonably alike.
The k-means algorithm operates on points distributed in space; in the illustration above it is 2-dimensional, but k-means can use any N-dimensional space we choose. The input to k-means is this collection of points in N-dimensional space—in our case each point represents a tile—and the algorithm returns a cluster: each original tile is in a group of tiles that are visually similar.
For implementation, I chose to work in Python with OpenCV. OpenCV is a popular computer vision library, and has a really excellent Python interface (it should be mentioned here that I’m the original author of its Python interface.) In addition to an implementation of the k-means algorithm, OpenCV has various other imaging operations that came in handy as I refined the implementation. OpenCV from Python gives very fast graphics operations, with the easy prototyping that comes from Python. Happily, Gameduino’s data preparation tools are Python-based, as well.
The key to getting good results from k-means is to choose a good space for it to operate in.
Choosing a Space
One simple approach would be to have k-means work in an RGB color space. The coordinates of a tile are simply the (R,G,B) average of the tile’s pixels. So tiles that are approximately the same color are clustered together. But this ignores all the actual pixel detail in each tile, and our perception of color changes. For example, we are much more sensitive to green shading differences than blue ones.
Much better to use a couple of well-known image processing tricks to place the tiles in a perceptually metric space. “Perceptually metric” means that visually similar tiles are close together in this space. Put another way, if two tiles look alike, they should be near to each other. This means that k-means can know that they are neighbors, and treat them as candidates for clustering.
In order to do this perfectly, you’d need a complete simulation of the human visual system, so that you could compute just what it means for two tiles to “look alike.” Fortunately an approximation does quite well enough. Representing pixels in a color space like YUV makes similar-looking colors have similar YUV coordinates—which is why it is used extensively in image compressions systems like JPEG and MPEG:
The next trick is to not use the tile’s pixel values directly, but to apply the DCT—Discrete Cosine Transform—to them:
Without going into too much detail, using the DCT expresses the tiles’ features from coarse to fine. For example, two tiles that have a dark blob on the left will have similar DCT values, even if the shapes of the blobs differ.
What do these things give us? For each tile, originally containing 64 pixels of RGB data, we convert the pixels to YUV space and apply the DCT to each channel. This results in a string of 192 numbers—coordinates in a 192-dimensional space where visually similar tiles are close together. After the samples are prepared, the k-means call itself is:
The labels argument is the k-means output: a list of labels, one per input tile, that says which cluster the tile belongs to.
Who’s Afraid of the 192nd Dimension?
The k-means implementation works just fine in any space—including 192 dimensional ones—and we might as well use all the data available so that it can make good choices about tile clustering. The fact that each tile is represented by a point in 192-dimensional space is really just an implementation detail—while we might have a lot of difficulty imagining a 192-dimensional space, the math of distances works just the same in higher dimensions, which is all k-means needs to know about.
This is not to say that very high dimensionality is always a good idea—in many cases it can be a recipe for numerical disaster. For example a tiny amount of noise in every dimension can utterly thwart the search for a signal. But in this case the goal is the expression of proximity, and higher dimensionality is fine.
Average of Clusters
After k-means gives us 256 clusters, the rest of the work is straightforward. For each cluster, the code merges the member tiles—just a simple average of pixels—and uses this average tile in place of all the original members:
This completes the process: for every 8x8 tile in the original image we now have a ‘closest match’ in the carefully-chosen set of 256 tiles.
How does it look? Well, it depends on the image. For images that contain a lot of fine detail, the algorithm does not do well: it tries to reuse tiles even though they are not visually similar, resulting in a scrambled mess. But for images that have a limited amount of detail, like this portrait, it does quite well.
The k-means clustering has fully used its ‘budget’ of 256 tiles. A handful of tiles reused many times make up the low-detail segments of the image—the background and the table. And it has used a few tiles once for the high-contrast details like eyes and the outline of the shirt.
How Much Does It Cost?
The encoding process takes about 1 second and 22 megabytes on a modern workstation-class computer. A rough estimate, given 30 years of Moore’s Law, is that the encode program would have taken about 12 days on a 1981 machine, assuming you could have found one with enough memory, and enough time between hair-care and car-polishing duties.
Many thanks to Scott Lembcke for his original suggestion and implementation for using k-means for tile allocation. CC source image by Steve Evans.
James Bowman wrote games for 8- and 16-bit consoles, and later made graphics software and hardware at SGI, 3dfx, and NVIDIA. James now consults on graphics, imaging, and hardware.