These are my notes from Karpathy’s makemore series. He explains Transformers from biagram models to RNNs all the way through Transformers. The concepts in each subsequent model build upon the previous models.
The culminating work is GPT2 from scratch in Jupyter notebooks.
The n-gram model is a character level language model that, given a sequence of characters, generates the next character. In this example, we use the average negative log likelihood as our loss function whose goal is to maximize the probability of the training data by minimizing the negative log likelihood. The average is obtained by dividing by N which is just normalization. We explore a couple of different model implementations here.
The first model we look at uses a matrix of counts to make predictions. The rows represent the number of times the ith character is seen first & the columns represent the number of times the ith character is seen second. The . character is used as the beginning & ending character of each in our example. The result of this matrix ends up identical to the result of the single layer NN due to the input of the NN being a one-hot vector which just picks out a row of the W matrix. The big downside is that if we wanted to scale to an n-gram model, we’d need to keep adding dimensions to this occurrence matrix which would not scale.
The bigram model does not, in & of itself, have any way to iteratively maximize the probability of the input. We can evolve this concept into a neural network & then apply gradient descent in order to minimize the negative average log probability of the model input. This is a much more flexible model than Bigram Occurrence Matrix which will become evident as we evolve to the n-gram model, this will improve upon the generation quality of the last model because it is able to efficiently scale to more features.
The next thing to do is adjust our weights optimally so that we can minimize our negative average log likelihood loss. We need a training loop with gradient descent for that.
We can abstract this functionality into a PyTorch module to make it easy to integrate & reuse.