Autoencoder from scratch

Apr 2021
Made with node.js

Click here for an interactive web demo of the autoencoder.

This project delves into the generative capabilities of autoencoders, which are a type of neural network. The basic premise is having a network comprised of an encoder and decoder. You can imagine this system like two people sitting on opposite sides of a wall. The encoder receives a picture of something and tries to describe it as concisely as it can to the decoder on the other side of the wall. The decoder then tries to redraw the picture as well as it can based on the description the encoder gave it. At the end of each round, the decoder gets to see the original picture and the two people refine their strategy to describe and draw the picture as efficiently as possible.

This project utilized a library of neural network code that I created (also from scratch) in a previous project, which meant that I could focus more on the structure of the autoencoder and refining some other aspects of the whole project.

Some technical improvements

This section delves into some nuances of the code that aren’t necessary for understanding what’s going on. If it doesn’t interest you, feel free to skip to the next section, where some of the cool demonstrations are.

One of the issues with the old project was that the networks ran inside of Chrome, which meant a lot of needless energy was spent on running a browser which could have instead be used to speed up the network. To fix this, I switched to using Node.js, which basically let me run the javascript directly from my computer console.

Another great benefit of running this directly on my computer was that I had a lot more freedom with saving things to files, so I could make a more robust way to save trained models. I settled on storing the networks as JSON objects (basically text forms of javascript objects) and adding a loading function to my code. This system gave me a ton of control over my training, since I could keep track of the performance of each model, test for issues like overfitting, and even pause and resume training if necessary.

However, now that everything is running directly on my computer, I needed a new way to see and interact with the network. Thankfully, Node.js is somewhat similar to Python in that there are tons of packages that can do all sorts of things. I used a popular package called Electron, which builds Chromium based application windows, to create a window for my autoencoder.

While the end result looks fairly clean, there was a decent amount of stuff I needed to do in the back to make it work. The Chromium window was basically a completely separate application from the actual script running the network, so the whole thing ended up being similar to a client-server system (just like ordinary webpages). This meant a whole lot of asynchronous requests and retrievals between the network and the window. A byproduct of using these sorts of two-sided systems is that it can be really, really hard to figure out which side the problems are coming from.

Generating images

Okay, but how exactly does this thing generate anything? And what are those blue sliders doing at the bottom of the window?

Recall the “two people split by a wall” analogy and the way the encoder describes some picture to the decoder. To be more specific, the encoder is only allowed to give the decoder a fixed number of descriptors. In my model, it’s 30 specific numbers that the encoder sends to the decoder each time it receives an image.

Once we have a trained model, instead of having the encoder describe images to the decoder, we can actually just give the decoder our own 30 numbers and have it try to figure out what to make. Going back to an analogy, if you asked an artist who was really good at drawing people to draw you a “6-foot-tall skinny man wearing a red top hat and plaid suit,” they probably could do it even if they hadn’t ever seen a person that specifically looked like that. Similarly, a decoder really good at drawing letters can make other things it hasn’t seen before, like symbols that could be letters.

These are some “letter-looking symbols” that the model generated when I messed around with the sliders. This type of generation does have certain limitations which happen to fit the human artist analogy.

For example, an autoencoder trained specifically on letters would be neither good at describing or redrawing pictures of dogs. In addition, giving strange or extreme requests to an autoencoder would also typically give weird results since the model doesn’t really know what to do.

Interpolating images

Another cool effect from this “description-based” generation is that you can morph between pictures based on their characteristics.

Since these letters are all just described by numbers on sliders, we can just slide every one of the numbers between each other and get this sort of effect.

Principal Component Analysis

A delve into the interesting math part of the project.

While the sliders are representative of the descriptors from the encoder, I did omit something when I described them previously. The sliders are not actually the original values from the encoder, but an optimized form of them found through a technique called principal component analysis.

If you think about each of the 30 “descriptors” as dimensions, then interpolating becomes more or less drawing a line between two points. In 2D or 3D, it’s easier to picture drawing a line between two points, but the idea is the same in our 30D model. Based on this idea, we can imagine that every single picture we have can be plotted on a 30 dimensional graph using those encoded values.

Something that can happen in autoencoders is that not every one of the 30 descriptors are completely independent of each other. Principal component analysis lets us fix that issue.

In essence, this technique tries to find the perpendicular lines on which the data varies the most, and then reorients the data to match those lines. By doing this, we can find along which lines or axes the data has the most variation (i.e. which descriptors are the most important for describing an image). In fact, every one of the 30 new principal component axes has an associated value which describes exactly what percent of the total variation it accounts for (basically how much of the total information it represents).

For example, the two most important axes in my model combined accounted for about 17.2% of the information used to describe each image. While that isn’t super high, it’s enough that you can see some of the variation when graphing the encoded values.

This is a graph using the values from images of the letters D, G, O, and Q. The graph is pretty cluttered, and there’s a ton of overlap, which actually makes sense. All of those letters are pretty similar in shape: they all are somewhat circular with some empty space in the center.

In contrast, here’s a graph of the letters I, O, S, W, and M. This has way more distinct clusters for each letter, which again makes sense given how differently shaped these letters are. However, the W and M (red and pink) clusters are on top of each other, which reflects how similarly shaped the two are.

The specifics of this process contain parts of statistics and linear algebra, and have to do with calculating the covariances of every pair of dimensions, storing those values in a matrix, and then finding an eigendecomposition for the matrix. The resulting eigenvectors explain how to reorient the original data onto the new principal component axes, and the corresponding eigenvalues show how important each axis is.

If you want to see this work in real time, I put up a live demo page for the autoencoder, with a tab for the standard “image reconstruction” and another for “interpolation.”