3blue1brown : Attention in transformers (early view)

can’t wait to watch the third part of the LLM series!

Hangyu Zhou

2024-05-30 11:13:10 +0000 UTC

Can’t wait for your third video in this series. Rough idea of when it will be available… :)

Gaurang Khetan

2024-05-29 00:45:03 +0000 UTC

Will there be an episode that explains more about the fast forward network part? Also in Mixture of Experts type models maybe? Thanks!

Shy Peleg

2024-05-14 09:50:05 +0000 UTC

Can you use backpropagation as described for a multi layer perceptron for a transformer, too? I would expect some adaptation is required.

Jan Prummel

2024-05-08 08:17:41 +0000 UTC

There's a small unimportant nuance to mention. In one of the animations which shows the entire text to be analyzed it seems like that a few lines starting from a certain token point to a future tokens in the text. The technique described here however discards all the values under diagonal in the attention matrix.

Tomas Javurek

2024-05-02 08:58:21 +0000 UTC

Looking forward for next chapters!

Carlos Morais

2024-04-17 07:47:10 +0000 UTC

I would love that too. Would also include Prompt Tuning (Soft Prompts), as it is the standard fine-tuning technique for most hyperscalers.

Carlos Morais

2024-04-17 07:46:43 +0000 UTC

Thanks Chris, this was very helpful!

Pradeep Madhavarapu

2024-04-08 02:07:01 +0000 UTC

Hi Grant, great video. So the attention layers allow for cross-contextualization between tokens but is all linear transformations, and the FFNN layers allow for non-linear transformations of tokens but doesn’t include cross-contextualization? Will you cover the intuition behind the separation of these two aspects in the next video?

Meghan

2024-04-08 01:33:14 +0000 UTC

Why not animate at 60fps? You make so many high effort animations & making them smoother seems like an easy way to improve quality. (On YouTube the video only appears for me at 30fps / not sure how much more effort it would be to upload at 4k60)

Max

2024-04-07 19:25:49 +0000 UTC

Great video! One question I still have is why does the final token become the prediction for the next token? I'm guessing it happens because the network is trained to do that (that is, the loss function compares the updated final token with the true next token). But what are the benefits of doing that? Are there other ways to do it? Could the network instead be trained to make the updated first token be the prediction of the next token? Could we append a dummy vector of all 0's in the position of the next token, and update that?

Eren Nimzowitsch

2024-04-07 01:26:08 +0000 UTC

Glass Ball example is more intuitive and give more sense additionally to the blue fluffy animal. One personal view is that I felt a bit exhausting but its acceptable considering the complicated nature of the topic. Thanks for picking this topic to explain! The fact of reducing parameters in Value matrix using Low-rank Transformation blew my mind. It saves a lot computationally. Think that this fact should also be underlined in your video... Is this the last part of Transformers/GPT videos? The scoring of parameters is also not finished and I think Decoders is missing and hence I reckon there is a continuation video further explaining till the last prediction.

Sai Nishanth

2024-04-07 01:14:35 +0000 UTC

They're created through random initialisation. They are tuned using a similar method to the ones described earlier in the original series. In fact, there really isn't a whole lot going on differently here, just a lot more operations in total.

finlay morrison

2024-04-06 17:37:10 +0000 UTC

Just to push back on this a bit, I don’t believe it’s been demonstrated that value matrices are superfluous. I took a look at both papers and their results were significantly more narrow than demonstrating that the value matrix is not useful in general. Their findings were in narrow use cases and with relatively small models. I think this may be a case of non-results typically not being worth publishing so if most other use cases found removing value matrices harm performance somebody isn’t going to write a paper about it. And if you look at most transformer literature the value matrices remain despite both the training and memory benefits of not using it. Not to say that this is a guarantee it is correct, but it does influence my guess. I work in this space and my research peers would be incredibly eager to publish on a finding as juicy as value matrices don’t impact performance in the general case (across architectures, model sizes, and tasks).

Chris Duvarney

2024-04-06 15:09:21 +0000 UTC

To clarify this point a bit, in a decoder-only architecture (which chatgpt uses), each token is essentially transformed into what the NEXT token is predicted to be. So by the end of the transformer the, for example, 3rd token is transformed into the 4th token. You can view the “enriching” process as first collecting information to understand what each token actually means in context, but then deeper in the network (after many layers) the tokens are then transformed into what they predict the next token will be. Since each token is transformed into what it predicts the next token will be your an take the output of the last token (a bit of a simplification) and smoosh that onto your input sentence. Then the output of that token can be used in the same way. Etc.

Chris Duvarney

2024-04-06 14:59:15 +0000 UTC

To clarify on this bit, encoder and decoder models are often determined by how they are trained/ used. A decoder model specifically tries to transform the embedding of some input token into the embedding of the token that comes after it so that it is effectively predicting the word. Since the task is predicting the next token it requires the use of causal masking (masking tokens that come after the current token). This is essential for generating text since the output of each token is the next word which you can add on to your input sentence and then generate the next token etc until you have your entire output. An encoder only model uses the same architecture of transformer and technically is independent of masking. But the output representations are not predicting what the next token should be and are instead representing some other information that will be used. They are also typically associated with not using masking for reasons I’ll describe below. You can think of ViTs (vision transformers which are transformer models used for processing images) as encoder only models since they aren’t trying to create representations that are predicting the next part. Encoder-decoder models try and combine these. Imagine if someone asks you a question and you’re trying to formulate a response. In a decoder only model you group the question and response into one block and just try and predict the response from the question directly since those can just be viewed as successive tokens. Alternatively, you can use an encoder on the question (since it is given all at once) so all tokens can attend to each other. Then you use a separate decoder model to actually generate your response. The decoder model is able to attend to all the representations of the encoder model and then causally attend to what it generated. You could imagine that this could be theoretically useful since the tokens in the encoder don’t have to try to predict what the next token is and only have to focus on collecting information useful to the decoder AND they aren’t causally masked so all the tokens can understand their meaning better by knowing the full context that they are in. Hope that helped!

Chris Duvarney

2024-04-06 14:53:09 +0000 UTC

Great video! What I am still missing is how the Wq, Wk, Wv matrices are created and tuned. Without that, the process still seems as a magic (magically there are such matrices that give the expected results). If this will be explained in future videos, it would help mentioning it. If not, it would help to give a focused references for were this is explained.

David Bar-On

2024-04-06 08:12:04 +0000 UTC

Glass ball example would be great, especially if you included some visualizations similar to the fluffy creature one. Very good illustration of how the subsequent steps are using the key vectors corresponding to the derived embeddings (ball > glass ball) (table > steel table) to affect alignment with the query vector for "shattered" in the subsequent layer. I think that's what you mean at least, I'm learning this for the first time myself. Re: cross attention, if I understand correctly that's not generally used in decoder-only LLM's, I'd leave it out for this GPT focused series but maybe introduce it in a following series on diffusion models. I'm not very knowledgeable about this subject, but it seems like that would be a more interesting tie in than illustrating with translation models.

Theo Rolle

2024-04-05 20:06:51 +0000 UTC

Watched the updated version and it is much clearer now. Thank you so much!

Parker Whitfill

2024-04-05 19:27:34 +0000 UTC

Thank you so much for the support, Grant! Excited to hear your thoughts on my other comment as well

Gautam Desai

2024-04-05 18:41:20 +0000 UTC

It's all a question of convention, I suppose, sort of like the big-endian vs. little-endian wars. It's probably my bias from a math background, but I'm partial to thinking in a column-major way.

3blue1brown

2024-04-05 18:00:44 +0000 UTC

The Q, K and V matrices really are just matrices, there is no ReLU or anything applied to their outputs. Why does the value vector not incorporate the vector it's changing? I suppose you could make it a bigger matrix that does this, but it seems not to be necessary. Each head can change _all_ of the token vector, not just a 128-long slice. The many different layers all have distinct parameters, so when you pass the embeddings through them all, they are undergoing a distinct operation each time.

3blue1brown

2024-04-05 17:58:36 +0000 UTC

You don't always apply this masking, and in particular, for transformers encoding images, it wouldn't be masked like this. I tried to add a few more words about this masking in the updated version, since it seemed to be a common source of confusion .

3blue1brown

2024-04-05 17:54:46 +0000 UTC

Agreed this was not at all clear in the initial draft. I've swappedped out the video above for an updated version which hopefully spells it out more clearly.

3blue1brown

2024-04-05 17:52:58 +0000 UTC

Thanks so much!

3blue1brown

2024-04-05 17:52:10 +0000 UTC

I think the first draft was a bit fast, I've tried to add more spacing and reflection in the updated version.

3blue1brown

2024-04-05 17:49:44 +0000 UTC

I tried to rewrite that section to be a little clearer about how it's motivated by what makes training more efficient. Essentially, even if once you're running it as, say, a chatbot, it feels like you would in principle want all the context to be able to talk to all the rest of the context because you can effectively scale up the number of training examples by thousands by having each run do many many predictions in parallel, it's better on the whole to apply this masking. As to how it would handle cases like the adverb one, I imagine in such cases it would bake all the relevant combined meanings into the latest word.

3blue1brown

2024-04-05 17:47:40 +0000 UTC

In the updated video, I added a bit more reflection and stepping back to give more of a chance for things to sink in. Hopefully that helps!

3blue1brown

2024-04-05 17:45:39 +0000 UTC

Nice, thanks for making this! I'll include it in the description.

3blue1brown

2024-04-05 17:44:12 +0000 UTC

Ah, well-spotted, in the updated version I've restructured it so that "crash" precedes car.

3blue1brown

2024-04-05 17:41:35 +0000 UTC

Hi Alexander, thanks for the thorough comment. I won't be able to structurally reformat things, and I do have some biases of my own, but I've definitely incorporated some of these thoughts into the updated version.

3blue1brown

2024-04-05 17:40:56 +0000 UTC

Thanks for joining, I'm glad you enjoyed it!

3blue1brown

2024-04-05 17:38:57 +0000 UTC

Hey Neal, thanks for the note. In a newer version I just added above I added a little reflection about how the attention pattern scales with context size, and how it's an active area of research to make that more scalable, hopefully that adds.

3blue1brown

2024-04-05 17:38:42 +0000 UTC

I'll probably throw in a 1-minute reference to cross attention in the final version of this one.

3blue1brown

2024-04-05 17:37:37 +0000 UTC

Hey Kevin, in a new version I just uploaded above I tried to zoom in to make that arrow point clearer, thanks for the note.

3blue1brown

2024-04-05 17:37:06 +0000 UTC

great video. One item i feel was skimmed over (and maybe intended for a later video) is how are those attention heads trained/determined?

Derek House

2024-04-05 16:00:50 +0000 UTC

I think i finally understood

Aslan Vatsaev

2024-04-05 15:24:03 +0000 UTC

I was similarly confused about the "car crashed" example given the triangular masking. If that is the case, how is it possible that "crashed" would ever influence the intended meaning of "car"?

Ben Tanen

2024-04-05 05:00:16 +0000 UTC

I think ‘the glass ball’ example is great to explain the rationale behind the stacking of attention layers. I just have trouble to reconcile it with the findings of the following paper: https://arxiv.org/abs/2012.14913.

Nariman Mammadli

2024-04-05 02:05:28 +0000 UTC

Fantastic video! One comment: in the video it's not clear whether W_Q is a single matrix, or whether each W_Q is a different matrix.

Martin Fixman

2024-04-04 20:25:11 +0000 UTC

Looking forward to your MLP video

Sathyaprakash Narayanan

2024-04-04 19:34:01 +0000 UTC

"Attention is all you need" - The title hits you in many ways than one.

Victor

2024-04-04 18:00:57 +0000 UTC

I wanted to also ask about what it means to have a decoder-only Transformer, or Encoder only transformer or Decoder-Encoder Transformer How each is more suited to a different task (ex. Word Completion, Translation ...etc) and how (if any) the attention masking process is different

Youssef Mohamed

2024-04-04 17:07:07 +0000 UTC

It seems like it was google after all who created the GPTs lol

Youssef Mohamed

2024-04-04 17:05:25 +0000 UTC

Thank you so much Grant, This is the best explanation of Embeddings and Transformer models ever, I have had a look at wolfram's description and lena voita, these are nice, but for me they don't even come close to decoding important points you mentioned here, and ofc the visualizations Thanks!

Youssef Mohamed

2024-04-04 17:05:03 +0000 UTC

Hi Grant, Thanks for the great video explaining this topic. I particularly appreciate the use of morphing images showing how the embeddings of the words/tokens are modified to change their meaning (direction). With respect to the value matrices, I appreciate the clarification that this is in fact 2 matrices (I have not seen that before), but I wasn't sure if the linear map meant that these were in effect being multiplied to create a matrix where the rows and columns do match the embedding size. Is the motivation for breaking this into 2 matrices just to save additional parameters? Lastly, you mentioned that the embeddings pass through multiple attention heads (in parallel I think), which would product a concatenated output of multiple modified embeddings, but it wasn't mentioned how that higher dimensional vector would be compressed back down to the size of the original input sequence in order to go into the MLP (if that is in fact necessary). Thanks again for all the great content and always finding novel ways of presenting complex topics!

Trevor Plassman

2024-04-04 16:58:33 +0000 UTC

Thanks for the great introduction! my head hurts a little, but regarding this topic I deem it to be a good sign of a working head. What I would suggest for the final version: Give an intro to the concepts of key, query and value matrices before hand. I at least stumbled a little bit each time you introduced a new one and was wondering "where does this come from now?".

Frank Hansen

2024-04-04 15:23:52 +0000 UTC

Thank you for this excellent video. Cant' wait to watch the next chapter.

Kane

2024-04-03 23:49:15 +0000 UTC

I think at the beginning of the talk, a slide with the following content would be useful for a "birds eye" perspective about attention. Words ask questions (Queries) Words describe themselves (Keys) Compare Queries & Keys to find connections Use connections to predict following words (Values) Adjust rules (W matrices) to get even better at this I think it'd also be good to explicitly describe the residual stream aspect of the network (perhaps in the next video?). I think you implicitly decribed it here: - The attention weights are multiplied element-wise with the value vectors to obtain a weighted sum of values for each word. - This weighted sum is then added to the original embedding, incorporating contextual information. If we denote the input embeddings by x, and the output of the attention layer as Attention(x), then this is mathematically output = Attention(x) + x Also, I'll plug two videos I thought were useful for me: On Residual Networks by the inventor of them Kaiming He, who also invented Kaiming Initialization): https://youtu.be/D_jt-xO_RmI?si=3gY76evGuLy5v3hd&t=2959 LLMs in 5 formulas by Sasha Rush: https://www.youtube.com/watch?v=k9DnQPrfJQs They might be useful for your 3rd video. Also a video on multimodal AI would be great too!

Sashmit Bhaduri

2024-04-03 22:59:19 +0000 UTC

My understanding based on the video, is that after all the attention layers, the 8th token will be an enriched version of "forest". Let us say that the predicted 9th token is something like the end of the sentence period (".") with some probabilities for other things as well. Does this predicted 9th token come from a 9th vector, or does it magically come from the 8th vector itself? If it comes from a 9th vector - where does that get added to the picture? If it comes from the 8th vector, where is the magic that transforms it from a better / enriched version of "forest" to "."?

Pradeep Madhavarapu

2024-04-03 22:53:46 +0000 UTC

It is not clear how the last vector corresponds to the predicted next word. In the given example "a fluffy blue creature roamed the verdant forest" there are 8 tokens and 8 vectors corresponding to them. The attention layers enrich/modify each of these 8 vectors to give them more meaning. Now to predict the next token we need a 9th vector right? Where does that come from?

Pradeep Madhavarapu

2024-04-03 22:46:35 +0000 UTC

There is no mention of biases, only weights. I think biases deserve some mention, even if to say they are absent for simplicity's sake.

Pradeep Madhavarapu

2024-04-03 22:36:22 +0000 UTC

Brilliant! Thank you for this great video. I wish somewhete further in this series you'll also explain LoRA and quantization. Thanks!

Alon Ashkenazi

2024-04-03 21:47:10 +0000 UTC

Sure but how about if we for example imagine that one attention head is dedicated to handling "plurality", then we would perhaps like the embedding for the word "kings" to be turned into something that simply represent plurality and nothing else, so the value vectors for the word "kings" and the word "queens" should be close. However in another attention head maybe we are focusing on gender, which means that the value vectors for the words in the previous example should be very different, perhaps even opposite one another. If we have no value matrix and just the original embeddings, something like this would not be possible I think? At least not as straightforwardly, since then the information passed along by any given word is always the same (just scaled by different amounts but always same direction in the space).

Gustaf Pihl

2024-04-03 21:16:37 +0000 UTC

But each head gets it's own learned Q and K matrices. Wouldn't you expect them to drift into separate pattern detection specialties, similar to how any two peer neurons in a single layer do?

Marcus Phillips

2024-04-03 16:39:38 +0000 UTC

This example is simplyfied: It is always the same Q and K for each attention head. You need two distinct matrices, since you need to create matching vector from different inputs. The quest and the key must match, so they can find each other. The key K Vector is a description of a token (like verb) and the q Quest vector is a vector which describes what a token is looking for (like: i am a noun and i am looking for a verb). If you now are also looking for numbers and names (to get the age of a person), you need a different Q and K, so you will have to add an other attention head. This is why ChatGPt3 has 96 attention heads. In reality, an attention head is very likely looking for more than one property (not only numbers but also questionmarks for example).

unhappy with patreon

2024-04-03 15:41:02 +0000 UTC

@Stuart - This is part of my confusion as well. It might be that K and Q are separate so they can be trained separately, since the information they encode is different, but at inference time it feels like one should be able to effectively combine them.

Matt Welsh

2024-04-03 14:27:25 +0000 UTC

Thanks @Unhappy - That is a good way to think about it! Obviously the reality of these matrices is that they encode more sophisticated relationships than just nouns and adjectives, but I like the example.

Matt Welsh

2024-04-03 14:26:00 +0000 UTC

Great question. In my understanding of the video, the first token would never change, since it is not influenced by any other token during interference. Is this correct? Or does the blinding of other tokens only happen during training?

unhappy with patreon

2024-04-03 09:10:42 +0000 UTC

I got the impression from the video that K and Q were pairs of matricies to be used. In the example of 'adjectives' it was the like a K for adjectives and a matching Q for adjectives... is that actually what happens or is it 'K for adjectives' and Q1 for 'is it fluffy' and a different Q2 for 'is it blue'? I.e. reusing the same K with multiple Q, otherwise I'm struggling to grasp why K & Q need to be two different matricies given K . Q will be a constant... I'm probably just needing more time to wrap my head around this bit though. Asking just in case anyone can help speed up my understanding on this bit... going to watch the video a few more times. (Btw, I absolutely love these videos)

Stuart Radforth

2024-04-03 08:19:05 +0000 UTC

For K and Q matrices: we need a matrix Q which says: "I am a noun and I am looking for adjectives in front of me" and an other matrix K which says "I am an adjective at position X". Since one matrix is marking nouns and the other marking adjectives, but both need create the same vector (from different inputs), you need two matrices. Since key and the quest vector need to match closely for related tokens.

unhappy with patreon

2024-04-03 07:46:51 +0000 UTC

Thank you so much for this video, Grant. I have read and re-read the original Transformers paper multiple times and never quite grasped the attention head concept until watching this. Great work. Normally in your videos you pause a bit to review and ask questions where the viewer might have some confusion, and I think that would be warranted in a couple of places here as well. As some of the other comments here indicate, there is some confusion about the intuition behind the need for separate K and Q matrices. In particular, since at the end of the day all we care about is the dot product of $K \cdot Q$ so it's unclear why we need two separate weight matrices ($W_K$ and $W_Q$) to be trained. Another big confusion I have is around the masking.. In various places you show the magical dancing yellow lines streaming between words in a passage, which are leading to words both before and after the word being considered - but the masking process would seem to preclude the possibility of a word being influenced by words coming after it in the input sequence. That is counterintuitive to me, since many languages have exactly that property: We might say "The furry blue monster" in English, but in French it would be perhaps "Le monstre poilu bleu" - the adjectives typically appear after the noun. We know that Transformers work fine for languages with different word orderings than English, so the idea that we disallow later words to influence earlier words is confusing. (Even in English this can be a problem, depending on the phrasing.) Thank you so much for your amazing work here!

Matt Welsh

2024-04-03 05:57:48 +0000 UTC

This was my question as well - how do the multiple attention heads in each layer get combined?

Matt Welsh

2024-04-03 05:42:41 +0000 UTC

when you state that the attention matrix has values between negative infinity and infinity before the softmax, the example in the video seems to have values between -100 and 100, not sure if that is just coincidence but it might be confusing to someone else. otherwise, this is video is incredible!

Ryan O'Kuinghttons

2024-04-03 03:31:43 +0000 UTC

Nice and poetic! Although I wouldn't call anything that uses back propagation brute force, since that is exactly what back propagation enables us to avoid. We can tweak the parameters in the direction of the error gradient, the fastest way to to improve the model (locally at least). Without it deep learning would never work and we would just be stumbling around in the dark. :)

Gustaf Pihl

2024-04-02 23:37:59 +0000 UTC

I think it could also just be the fact that these are inherently quite confusing concepts. Regardless of how great you are at presenting things pedagogically. I think that if you're going to actually get into the details of how it works as Grant usually does, it might not be possible to convey without some initial confusion.

Gustaf Pihl

2024-04-02 23:32:28 +0000 UTC

Hi Grant, Amazing videos! I have one small piece of feedback. Perhaps for future videos in the series it would be helpful to use more basic language in the examples. I ended up googling "verdant" when presented with "a fluffy blue creature roamed the verdant forest". Although the exact meaning of each adjective is not super important to understand, I found that not understanding certain words added unnecessary mental overhead. :)

Gustaf Pihl

2024-04-02 23:28:55 +0000 UTC

Is attention all you need?

Gustaf Pihl

2024-04-02 23:19:27 +0000 UTC

Hi Neil, I'm not 100% sure about this so take it with a grain of salt. From what I can understand intuitively, another benefit of having the value matrix is that we can have different attention heads focus on different things. For example maybe one head could be dedicated to figuring out plurality. In that case the value matrix could just turn each token embedding into a value vector that just contains information about whether that embedding is plural or not, while other attention heads could focus on different types of information. Additionally it seems reasonable that we want to extract different types of information at deeper layers which could reasonably be assumed to be dealing with higher and higher level concepts. What do you think?

Gustaf Pihl

2024-04-02 23:18:08 +0000 UTC

Hi David! Space in this context refers to a vector space which is one of the fundamental concepts in linear algebra. When talking about the key space, Grant is referring to the vector space that the key vectors "live" in. The dimension of this space will be the same as the number of values in a key vector. The beautiful thing about natrual language processing (NLP) is that after training a model, these different spaces are kind of imbued with semantic meaning. This means that the different directions of the key space capture concepts relevant to the prediction task. Although technically, any vector space with the same dimension as this key space is equivalent (in a sense just the set of all possible vectors of this dimension), the key matrix that transforms embeddings into this space is what imbues it with the semantics. The key matrix is what takes in a word (token) and returns a vector in the key space. Hope that helps.

Gustaf Pihl

2024-04-02 23:09:27 +0000 UTC

Interesting. I guess by keeping the tokens from influencing previous ones you do save a lot of computational work, despite maybe losing a bit of context. Didn’t even think about the vector for “red” getting influenced by “tree” though. Good point!

Bryce Lambert

2024-04-02 22:59:47 +0000 UTC

Intuitively I was thinking that the point of having the value matrices is that different attention heads could specialize on different things? It seems to me that if we get rid of the value matrices and just use the original embeddings then the semantics of each attention head/block would be the same? This is just what I had gleaned intuitively and I don't mean this as a rejection of what you said. Curious to know how you see it?

Gustaf Pihl

2024-04-02 22:58:26 +0000 UTC

Hi Bryce, I don't know the answer to this but I had a little thought that could be relevant. Although the context of the lushness of red leaves will not be passed to the tree vector, the tree vector will be influencing the rest right? So at least the vector for red will be "baking in" tree-ness! I suspect the reason for covering up the "future" tokens is to do with the way the model is trained efficiently, with each token predicting the next. I do agree that it seems like context is being wasted though. I heard recently that Meta have been experimenting with training LLMs with reversed text as well as non-reversed text. Although this was related to a different specific issue that models have, I guess this may help alleviate the issue?

Gustaf Pihl

2024-04-02 22:48:26 +0000 UTC

Hi Dan L. Sorry to say there are no waveforms involved. The knobs on a synthesizer control things like oscillator frequency, ampitude, cut-off filter and so on. The "knobs" here are simply weights that scale the input from previous layers up or down by different amounts. The whole structure of artificial neural networks are inspired by the neural networks we see in the brain though, so although they are not analogous to brain waves, there are still analogies to be made with respect to the brain.

Gustaf Pihl

2024-04-02 22:41:56 +0000 UTC

Hi Priyanshu. The answer to your question is no. The temperature is not something that is dynamically adjusted by the model. When you are talking to ChatGPT via the standard web interface, the temperature is fixed and cannot be changed. As far as I know you can only tweak it when using ChatGPT via the API.

Gustaf Pihl

2024-04-02 22:39:16 +0000 UTC

same here

unhappy with patreon

2024-04-02 22:26:07 +0000 UTC

Hi Grant, once again, this video is amazing. After re-watching it, I have a couple comments: 1. It might be worth mentioning to viewers what the added benefits of the value vectors are. I think the need for the keys and queries is obvious, since they are used to calculate our attention scores, but it may not be obvious to a viewer why we don't just multiply the T x T attention scores against the "raw" embeddings that were passed in to the attention block. We get more parameters and overall more complexity out of the model, and I also like to think of the "value" as being what each token chooses to actually expose, or share with other tokens; if the key is all the information a token has, the value could be thought of as a "subset" (loosely speaking) of the key. I think there are some intuitive examples where a token might not choose to share all it's information, since all of it may not be relevant for the ultimate (training) objective: maximizing the likelihood of the next token. 2. I think it could be made slightly more clear that the outputs of each head in multi headed attention are concatenated before being passed on to the MLPs, but that obviously isn't the main focus. Lastly, one last thing I wanted to highlight in the list of coding problems I previously shared over DMs is this particular problem I created: https://neetcode.io/problems/self-attention Students can implement Self-Attention from scratch and run their code against test cases. Basic PyTorch knowledge is required, but I have a prior problem in the list that teaches the basics of PyTorch. It would be amazing if you could share this full list of coding problems! https://www.gptandchill.ai/codingproblems Thank you for all you do, Grant.

Gautam Desai

2024-04-02 21:44:27 +0000 UTC

Love the new videos Grant! Suggestion: I believe it's been demonstrated that the value matrix is superfluous to the whole process, which explains how difficult it is to provide a good motivating story for why we need it. You can just use the original embeddings in lieu of the tensors you'd get from the value matrix (see https://arxiv.org/pdf/2311.01906.pdf and https://arxiv.org/pdf/2005.10463.pdf). For this reason, when I teach transformer architecture, I find it extremely useful to omit any mention the value matrix at first, and describe the process as simply creating a weighted average of all the embeddings from the context window, where relevant tokens embeddings impart their flavor onto the target token embedding. Then, at the end of my lecture, I return add an addendum image about the original architecture's choice to include a learned value matrix transformation, and suggest that it was a belt-and-suspenders attempt to get the architecture working. Other findings include that many of the normalization and project steps can be safely omitted, and the MLP can be run in parallel to the attention layer. I haven't demonstrated the superfluousness myself, so it's possible I've been mislead by these papers and my by friends who have run experiments directly, but I'm pretty sure V is incidental complexity you can cut out of the story except as a footnote!

Marcus Phillips

2024-04-02 21:38:59 +0000 UTC

As an addendum, because everything is row-major in practice, I usually think about the linear transformation AV (where A is the self-attention or cross-attention matrix and V is the set of value vectors) the reverse of the way you described in the linear algebra series. Meaning, I view the rows of AV as: 1. take the first row of A. 2. Use that row as coefficients to weight the rows of V. 3. Sum the weighted rows of V. That is the first row of AV. 4. Repeat to get all rows of AV. This view of matrix multiplication is essentially the transposed version of what you described in the linear algebra series, and I find it gives more intuition about what the attention matrix is doing than the column-major view.

Alex Loftus

2024-04-02 20:44:50 +0000 UTC

So, I think in the original Vaswani et al. paper, the $QK^\top$ matrices are row-major by convention, not column-major. Meaning, Q, K, and V are matrices where each row corresponds to a token embedding -- they are each shape (num_tokens, d_k). At 8:44 in the video (and onwards), you have them as column-major. You can see this in line 57 of Andrej Karpathy's transformer implementation as well: https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436728a420e85796c57dba7e/model.py#L57

Alex Loftus

2024-04-02 20:38:39 +0000 UTC

Thank you for your great video. it is the first video which made me understand transformers. My questions for the attention heads: Are the k q v matrices really just a matrix? Or a (single layer) fully connected neural net which outputs the key / quest / value vector? Why does the value vector not take into account which word it does transform. So why is it V(E1) instead of V(E1,E2) if word 1 and word 2 are related? To much computation? If there are 96 heads and each one is changing 128 elements of the each token vector, can't you say, each head is changing ~1% of the resulting vector (the destination)? The first head changes the first 1%, the second vector etc. Are there multiple attention 96 attention blocks, so the example with the glassball will work? So the layers which correlate "glassball" with "it" won't be ineffective, if they are in the wrong order? Are the header blocks the same, meaning: you pass the same token vectors for for 100 times through the same block? Edit: I am sorry for my bad english as a non native speaker.

unhappy with patreon

2024-04-02 20:07:57 +0000 UTC

I have a doubt related to the Temperature factor you described. When I ask ChatGPT to "think like a playwright at Walt Disney", does this adjust the Temperature in the back-end? As in making the responses more "creative sounding"?

Priyanshu Gupta

2024-04-02 18:57:31 +0000 UTC

I too have been reading up about the ESM model. Nice to meet another user. :) As for the training, they do describe it in their article- "We were able to complete this characterization in 2 weeks on a heterogeneous cluster of 2000 graphics processing units (GPUs), which demonstrates scalability to far larger databases." Source: Evolutionary-scale prediction of atomic-level protein structure with a language model However, my understanding is that this only discusses the final training before publication, and completely ignores the resources+time spent on training/testing in order to get to the final model.

Priyanshu Gupta

2024-04-02 18:53:14 +0000 UTC

+1. I get that we can do all that in parallel but how do we put it back together?

Max Goldstein

2024-04-02 15:32:20 +0000 UTC

I just had one question about how the keys of tokens are ignored by the queries of the tokens coming before it in the input prompt. Wouldn’t this be unnecessary if, say, the model was being used to generate an image? For example, if I wanted to turn the phrase “The tree was lush with red leaves,” into an image, wouldn’t the adjectives “lush” and “red” be important in the context of the tree, despite occurring after the word “tree”? Just something I was wondering. Great stuff otherwise!

Bryce Lambert

2024-04-02 15:24:01 +0000 UTC

For the multi-headed attention, can you work through an example? I didn't understand if you took the original embedding and then sum the 'change' according to each value.

Parker Whitfill

2024-04-02 14:27:32 +0000 UTC

Amazing series, thanks for creating these. So here's a bit of Woo I was thinking of: all of those tunable gpt parameters remind me of a synthesizer, and in a similar way, does it produce an overall waveform? And if so, is that "thought" and analogous to brain waves?

Dan L

2024-04-02 13:46:34 +0000 UTC

Nevermind. On second viewing, it seems clear enough. I must have not paid enough attention.

Petr Čertík

2024-04-02 12:35:06 +0000 UTC

I subscribed on patreon just to get access to this video immediately!

Riley Harris

2024-04-02 12:32:56 +0000 UTC

Thanks for the comment, I'll try to alleviate confusion in a rewrite, and with some improved animations. In the meantime, here's a quicker answer. Whether or not this particular example is used as part of training (meaning the weights will get tweaked a little to improve) or if it's just a run of the model with user input after it's been trained, the way the model processes data will have to look the same. This means features which are mostly motivated by what makes training more efficient, and are less relevant to runs with user input, still apply when you're doing it with user input. In particular, even if masking out the lower right triangle of the attention pattern and doing the many predictions for each initial subsequence are more relevant during training than they are after, they both still occur when you run it on user input. --- To your question above about Wq / Wk, above, it might be worth stepping back to emphasize that with deep learning in general, you will have lots of operations parameterized by matrices filled with tunable weights, and you have some cost function, a measure of how poorly the model does on a given example, and the goal of training is to somehow tweak all of the weights to make it reduce the cost function. The earlier videos in this series talk all about gradient descent and backpropagation, for example. In practice, it's often very hard to dissect _what_ these matrices are actually doing. For the sake of explanation, I find it helpful to have a hypothetical example of the kind of thing you might _hope_ it will do, and to walk through how the given computations could conceivably, with the right parameters, do this task. A given attention head has a single Wq matrix, and a single Wk matrix, and each one gets multiplied by all the embeddings in the context, producing a sequence of query vectors and key vectors respectively, one for each position in the context.

3blue1brown

2024-04-02 12:13:01 +0000 UTC

i'm at 11:29, and my conclusion is that i really, REALLY want a very high level conceptual overview at the start of the video. tell me about how words can relate to each other. give specific and motivating example sentence that shows this. introduce "Query Vector" and "Key Vector" in this, and the chart that shows big circles for Keys that answer queries [without mentioning the dot product, or maye not even numbers]. Talk about attending and the Value matrix here. i want a high level overview that presents to me the concepts first ,and the MOST important items of jargon (eg "Query Vector" and "Key Vector") and what concepts they concern with. i'd like this high level overview to cut out all implementation details (eg Wq, Wk, normalization). this high level, very conceptual overview would then help me understand the rest of the video. but actually, for me as a layperson, i'm not really interested in the math details. i'm primarily instead interested in how transformers work conceptually. the previous video showing me how semantic information is somehow "baked into" the emcoding of each word in the multi-dimenaional space was very exciting. (the many examples you showed, eg Japan->Germany and then adding that to Sushi gives Bratwurst, was exactly the kind of conceptual information that illuminated how transformers work, to a layperson like myself)

David Nijjar

2024-04-02 12:01:35 +0000 UTC

9:32 i had to rewind and listen and think really carefully, to understand :The model doesn't just predict what will come immediately next. It also predicts every possible next token for each initial subsequence of tokens in the passage." the animtion sort of helps my brain understand what "each initial subsequence of tokens in the passage" means, but it took me pausing the video and thinking to understand what "every possible next token" means. all i see in the animation are "???" boxes, which were too abstract for me to realise what they meant. i think i needed an explicit example here. say, the narration and animation saying: "for example, for the subsequence 'The fluffy blue creature', the model might predict the next tokens 'ate', 'roamed', 'slept', and so on, in a softmax matrix". (at least, am i understanding it correctly? each "???" box isn't predicting just a single next token, but a probability distribtion of next tokesn?) ---- the following comment ("This ends up very nice for training, because otherwise what would be a single training example, effectively acts as very many.") i completely don't understand. w.. what are we training?? i thought we were talking about Query vectors and Key vectors and their dot product. are we .. training .. Wq or Wk?? And .. this sentence that we're asking the model to complete ("The fluffly blue creature roamed the verdant forest") is something that the user is typing in, and asking ChatGPT to complete. is .. user input .. a form of "training" .. somehow???? and what is a "training example"? is the user typing in "The fluffy blue creature roamed the verdant forest" a training example???

David Nijjar

2024-04-02 11:48:16 +0000 UTC

i'm still so confused about what Wq and Wk do, or how. somehow, i (think?) Wq is a matrix that, when you multiply it with a word's matrix, it creates that word's Query matrix. but, every animation that has "data" flying over Wq, saying that Wq is tunable and is part of the model, goes WAY over my head. is the same single Wq matrix used for EVERY sentence? conceptually, what does Wq even represent? what does it meant to "tune" it? is it like "The rules of grammar: adjectives can modify noun"? is it something else?? (do i even need to undrestand what Wq and Wk are, or can they be elimatinated entirely from the explanation; where you just say that "magically" the Key and Query vectors are created (after explaning the kinds of use Query vectors have etc) ?

David Nijjar

2024-04-02 11:37:53 +0000 UTC

6:23 the narration gives a very abstract sentence, which my brain couldn't understand: "You can think of the keys as potentially answering the given Queries". i had to rewind, and then only understood when i saw the animation where a Query is asking the question, and a Key is answering it. in other words, the specific example in the animation was what my brain needed; the generalised abstract statement in the narration was too difficult for my brain to understand. i'm starting to think that i would need a much higher level view first, to help me note the "key jargon" or "key items" that i will have to look out for. a higher level view that doesn't mention Wq, or any other math, but instead says something like "here's a bunch of words. these words relate to each other. for example, in this sentence, these two adjectives relate to this noun: the adjectives modify the noun. the Attention algorithm analyses how all the words relate to each other, and stores the results of this analysis by giving each word a Query vector. a Query vector holds questions that are relevant to each word. for example, nouns might have a a query vector that holds a question: "are there any adjectives preceding me?" " etc.

David Nijjar

2024-04-02 11:33:54 +0000 UTC

5:41 i listened to this sentence three times, and my brain still has trouble parsing it: "We'll suppose that this Query Matrix maps all of the embeddings of nouns to certain specific directions in the smaller query space that somehow encode the notion of looking for adjectives in preceding positions" there are a lot of words that my brain has to understand, that aren't trivial! Some things my brain has to understand: - what does "maps" mean? - what were "embeddings" again? - what does he mean by "certain specific" directions - why is he calling it a "smaller query space"? is there a "larger query space"? - what does "encode" mean? and what is the subject of this verb? is it "specific directions encode", or is it "smaller query space encode"? - what does "notion" mean in "notion of looking for adjectives in preceding positions"? has he .. shown me how a "notion" can be "encoded" into a matrix .. and i just missed it?? at minimum, this sentence could really benefit from being broken up into 2 or 3 sentences, i think. --- even with the animation showing the noun "Creature" being floated to the right side of the screen, i still am not sure. the best i can understand is that "Creature" is a noun, and it is associated ("mapped") to this yellow vector. and this yellow vector is "a specific direction that somehow encodes the notion of looking for adjectives in preceding positions". but i don't know what that means! (or what Wq has to do with any of this). are you saying that that yellow vector is saying: "hey, the last two words are adjectives that modify 'Creature' "?

David Nijjar

2024-04-02 11:23:59 +0000 UTC

Great video! It's useful as I'm beginning to use ESM (a protein language model with a significantly smaller vocab of ~20 instead of 12,288). I'm curious how arbitrary these numbers are. Why 96 heads and 96 layers? Why 128 dimensions for the query and key matrices? I wonder if performance or training will change significantly with different numbers. Maybe you'll address this in the training video. Other thoughts I'm having: 1) Training this must've taken a year on thousands of GPUs! 2) I struggle to believe that these large matrix operations can run so fast on the cloud when I ask ChatGPT something.

Samuel Lobo

2024-04-02 10:10:48 +0000 UTC

Awesome work! There's one thing I miss from the explanation of attention heads, though: as each head changes the embeddings in a different way and in parallel, how is the effect of each head combined? My guess is that every attention head produces different value vectors and these are then simply added to the embedding. But that is not clear to me from the explanation, which makes it seem like the result of an attention head is a modified set of embeddings. I think it's worth explaining how that works a little bit more closely.

Petr Čertík

2024-04-02 08:44:43 +0000 UTC

I was thinking of the French language, where adjectives tend to come after the noun. I get the motivation for not wanting later words to influence earlier ones, but I'd have liked to understand how the model deals with cases like these.

Petr Čertík

2024-04-02 08:36:05 +0000 UTC

Agreed, this is unclear, especially since later in the video it shows an example of a later word influencing a former word.

Edan Maor

2024-04-02 08:21:33 +0000 UTC

This is the first time I am going through the Transformer network and amazingly this very elaborate concept is making sense already. After going through the video, I come to understand that all this exercise with Key, Query, Value, Embedding etc is to just update the embedding of the input word embeddings. Perhaps you can add it as an explicit comment. After adjusting the embedding vector with this gigantic exercise, it seems we are then feeding it to our neural network ( multilayer perceptron) and then repeating the whole process 96 times.

Virendra Singh

2024-04-02 07:45:20 +0000 UTC

Great video! Would love if you could explain why we need to learn a Value matrix for each head (why can't we just use the original embedding & weight them using the similarity between Query & Key)? My guess is that the Values are what let us create a "new" direction that we couldn't have obtained from the original embedding (or linear combinations of that embedding + others in the context). That might be wrong / be a bit hand wavy though! Another topic that might be worth clarifying/including: CS224N describes multiple heads as "not being more costly" because we're just reshaping such that the # of heads is like a batch axis (https://youtu.be/LWMzyfvuehA?t=3213).

Neil Johari

2024-04-02 07:11:03 +0000 UTC

in 5:49, what is a "Query/key space"? is that the "Query vector" that each noun has (which holds information about things like if that noun is modified by adjectives)? why is it called "key" space? we didn't talk about "keys" yet.. or did we and i didn't remember?? [why is it called "space" at all? is that just a fancy way of saying "vector"?]. [i think the word "space" would be okay for my brain, given that in the previous chapter, you did have wonderful examples showing Mother->Father vectors etc in a 3d space. but given my many points of confusion at this point in the video, my brain couldn't remember what "space" meant]

David Nijjar

2024-04-02 05:59:16 +0000 UTC

note: my comment is a perspective of someone without a mathematics background, and who is watching only with 70% attention. a) the first time in the video that i started to feel lost, was at 4:57. "That question is somehow enoded as yet another vector: what we'll call the 'Query' for this word". - if i was watching at 100% attention, this would be no problem for me. - but at 70% attention, the main idea: "Each word has a query vector; each word's query vector asks questions about how that word is being modified by the rest of the words aruond it" got lost in my brain. i think it's because directly prior to it, the talk of nouns being modified by adjectives made sense to me, but i didn't realize that it was going to be a "main idea" that i would have to remember; or that this "main idea" would be encapsulated by a jargon word ("query vector"). it somehow happens so fast, to be told "what we'll call the Query for this word", that my 70%-attention brain has trouble suddenly encoding the meaning of "Query vector". i have to backtrack in my memory to remember what was just talked about to find the meaning to encode, and then associate it with the jargon "Query vector". b) even re-watching at 100% attention, the mention of Wq at around 5:20, still makes me confused. i have thoughts like: "wait, what? was i supposed to know what Wq is? was it mentioned before?" and "shoudl i understand why it's called Wq?". whe i hear "The entries of this Query matrix are weights of the model", i am VERY confused. is Wq pre-created even before the user (for example) types some input for GPT2 to continue ? or is it created anew after the user types their input? if it is pre-created, how is Wq smart enough to create the Query vectors; ie, just what it is? should i care? [another confusing part is that when you say "The entries of this Query matrix", there is a highlight around the Wq matrix in the animation. but we just talked about a "Query vector", and my brain is getting confused between the two. my brain is being asked to understand that "Query matrix" is not a "Query vector", and my brain asks: "What, what? 'Query Matrix'? was this explained before? and why does it sound so similar to "Query vector"? are they the same thing but i'm too dumb to understand this?] i think my confusion was about how much i should already understand about Wq. if you said that it was pre-created, and it just somehow "magically" works to create a Query vector, then i would understand.

David Nijjar

2024-04-02 05:57:09 +0000 UTC

That's the bit I had to rewatch, pause, take notes. It's all there, it's just that it's very dense, a lot packed in to a short segment. too much to understand in real time

Quirkz

2024-04-02 04:56:05 +0000 UTC

Got a bit confused around the query, key and value part - would suggest adding a bit more information

Sunaje

2024-04-02 04:23:16 +0000 UTC

Just an update: Watching a second time, with pauses, repeats and taking notes helped me get a much better understanding. Again, the video is excellent, it's just dense.

Quirkz

2024-04-02 04:07:07 +0000 UTC

I agree. The first chapter was no problem but this one I'll need to watch a couple more times to digest more. Maybe give a mini-recap and some time to process Q and K before diving directly into V? I'm used to listening to things at 2x speed but started to feel a bit overwhelmed (at 1x speed) when V came up. There's a lot to absorb if you haven't seen it before, so I recommend erring on the side of overexplaining in a longer video rather than trying to keep it concise and streamlined, but ultimately I trust your judgment, Grant.

Alexis Olson

2024-04-02 03:36:20 +0000 UTC

Yeah, I get the sense that there's wide agreement that ChatGPT has essentially passed the Turing test, and some sentiment that the Turing test wasn't that good a test actually. "AI" is whatever the models still can't do.. xD

adfaklsdjf

2024-04-02 03:34:03 +0000 UTC

I found myself wondering why exactly we don't want later words in the prompt to influence earlier words. What about phrases where an adverb modifies a verb, and it comes after the verb? "She ran quickly".

Duncan Fairbanks

2024-04-02 03:11:57 +0000 UTC

Yes indeed, all of these are very much harder to parse in practice. They are filled with weighted that are initialized at random, and learned based on backpropagation against some cost function, in this case one that punishes the network for giving a low probability to the true next word. I can try to better emphasize how the motivating example here is meant to illustrate the mechanism, but that in reality these matrices will do whatever the training process leads them to do so as to better accomplish the task at hand.

3blue1brown

2024-04-02 02:37:50 +0000 UTC

The experience of watching each of the videos--first of all I saw video 5, then went thru 1-4, 5 again, and back here for 6, so I have the full progressiom loaded into my head right now. In 1-4 you hint at how back propagation could effectively generate a piecewise process for pattern recognition across layers, for a text recognition example. What we get to by the end of this video, is a similar structure, but expanded in one dimension with parallel attention blocks and another with extra layers. Within those attention blocks you have key and query operations providing additional processing, migrating words thru context space by way of their positions in vector embeddings. What might be a nice way to tie this all together is to explore how each of the matrix multiplications adds space for a reasoning process to take place that looks like the semi-brute force process the layered models in the first four videos set up via weight adjustment and back propagation. Somehow, inside all those matrix multiplications, abstractions and world models are able to influence narrative construction. Let's explore how that happens?

Mark Strachan

2024-04-02 02:33:20 +0000 UTC

I 100% agree it needs to be clearer. The outputs of the heads don't interact with each other, except insofar as they are added together. There are two ways to think about it: 1 (the normal way it's presented): For each position in the context, each head produces a little vector, say 128-dimensional. These are all then concatenated, stacking up to make one big vector, say 96 * 128 = 12,288 dimensional. You then multiply that by a big 12,288x12,288 matrix to get the final amount you'd add to the embedding vector in that position of the context. 2 (what I think lends itself to a clearer visualization of what's happening, though I realize it does look confusing all written out): For each position in the context, each head produces a proposed change to the embedding in that position. This is a large 12,288-dimensional vector. All of those proposed changes are added together to get the total change that we add to the final embedding. You can get from the first framing to the second by taking that large 12,288x12,288 output matrix and breaking it up into a bunch of blocks 12,288x128. Each of these blocks is acting on just one of the 128-dimensional outputs of the attention heads that got stacked together. I realize that might still be confusing, I'll try to animate something that helps make it all clearer.

3blue1brown

2024-04-02 02:28:46 +0000 UTC

This chapter is tougher. I've grasped half, and am going to need to go through it again once or twice, taking notes. I don't think it's because of the quality of your explanations: they're excellent, as always. It's just this is a bit tougher to really understand compared to the previous parts. There's a lot behind the simple pictures that I'm having a hard time building up an intuitive understanding with just one watch of the video.

Quirkz

2024-04-02 02:08:42 +0000 UTC

Thanks for the great work Grant! I think more focus should be added to how the heads combine - from my brief reading of what you said, this output-matrix based approach lets the heads of attention interact with each other, yes? This was not at all clear during the video. I also understand the educational value of just focusing on one attention block at a time, but I think the scaling of attention blocks with multiple different ones could've been made more clear; it felt surprising during the video instead of an obvious generalisation (which to me it feels like it is, on reflection).

_ericBG

2024-04-02 00:19:37 +0000 UTC

Excellent video, like all your others! This one is a bit beyond me though. I can't say I completely understand all the math, but I think I get the general idea, which is that this kind of machine learning, based on neural networks, seems to be the most promising approach to AI, and it could even lead to AI passing the Turing test, as it seems to me that it already has. I'm intrigued to watch more on this, so that perhaps someday it will all make sense to me. Keep up the great work!

David Terr

2024-04-02 00:18:57 +0000 UTC

For your very first question, there's a famous mini essay called "the bitter lesson" which I think is basically standard reading for AI engineers these days. The tldr ends up being "do not try and specialise for the context, add more parameters/more compute, that will scale better in the long run"

_ericBG

2024-04-02 00:16:14 +0000 UTC

Hi Grant, Firstly, both Chapter 5 and Chapter 6 are amazing. But how much depth to give to Cross Attention for someone's first pass at Attention is something I've debated as well, in the "Build a GPT from Scratch" series I created: https://www.youtube.com/playlist?list=PLf2BgkdQjMYsyPx7HPO4M6aqTfPl2iEPx I think the depth that you decided on is perfect. One thing that might be worth mentioning is that Cross Attention is useful for building multimodal generative models, i.e. an image captioning model, where images are passed into the encoder of the network, and queries come from the images. Additionally, I have created programming exercises that students can solve in browser, that build all the way up to implementing GPT. I shared more about it over DMs, and it would be amazing if you could share it in the video descriptions of Chapters 5 and 6!

Gautam Desai

2024-04-02 00:08:55 +0000 UTC

15:24 typo on "suddenly" great video series as always. for my first time being exposed to these concepts they're fairly heady, I think more examples would be a good idea and wouldn't mind the increase in video length :) I think the hardest part for me to keep track of is exactly what each matrix represents (Q, K, O*, V) and how they're updated by training. If I understand correctly, the sample Query matrix representing "adjectives before the noun" was a lot more literal than any real Query matrix would be, right? Is it correct that they're not typically parseable in a human-readable way? And then how exactly is the K matrix generated? Does it just start randomly and get updated by training like everything else?

genericbandname

2024-04-01 23:47:40 +0000 UTC

Absolutely fabulous series! Your steel/glass table/ball example seems like it would be great to include. The output matrix reworking felt fine as is although I admit I got a bit confused about where the concat/reshape slipped off to until I read your comment. I'll also say that the "car crashed" example seems a bit off given the trianglular masking gets mentioned just a minute before. Would it make sense to move that earlier around the mole/tower examples? Or maybe in the middle just to help remind us of the end goal? Any chance we'll get a vision transformer example later sometime? Maybe when you get to positional embeddings? I'm doing my PhD in diffusion stuff so image ML stuff is particularly close to my heart.

Sam Sartor

2024-04-01 23:13:33 +0000 UTC

I’m happy to see you cover such important yet complex model in detail. Great work! For context of the following: I initially learned about transformers from Jurafsky & Martin (2020), a great book which I recall to have chosen to go with BERT (non-masked SA) in the 2020 edition, so my pov might be biased: 1. Following this bias, my initial impression was that it could be easier to take in, if one went a bit more top-down at the beginning, and abstracted away the K,Q,V transformations to introduce the overall mechanism first, i.e the “weighted sum part” — but that might of course come with other pitfalls. 2. I found out that I completely overlooked the down- & up-projection and would love to see a more elaborate pointer on it. I’m asking myself why this transformation should be low-rank. Wouldn’t it be more prone to loose information? 3. Ideas for an outlook on cross-attention (CA): A) Encoder/Decoder models that circumvent the bottleneck of previous RNN architectures. B) Multimodal fusion, i.e. text-to-image/image-to-text CA; or, to stick with the weighted sum level of abstraction: “rebuilding one modality from the other” (see e.g. GroundingDINO: https://arxiv.org/abs/2303.05499) 4. I’d like to see the additional motivation for multiple layers. Assuming you’re not 100% satisfied with the comparison you suggested, maybe just go a bit more general towards higher layers being capable of higher levels of abstraction? 5. I would highly recommend to also call out the latest edition (of 2023) of the mentioned book. It assumes little to no knowledge of DL, and can be read chronologically. Plus, its open source: https://web.stanford.edu/~jurafsky/slp3/ Hope some of this is helpful; would love to talk more about it! Generally, I hope you go all in with “short on screen full text explanations”, so that even people who already know the model can get a rock hard fundament all the way to the bottom (see thought 2). Thank you for everything you do Grant. Keep up the good work! Warm regards Alexander

Alexander B

2024-04-01 22:28:43 +0000 UTC

I am very impressed by the attention (pun alert) to detail in the detail in the illustrations, for example how verdoyant + forest turns into a green forest. FWIW, I think the glass ball example would be great to explain the usefulness of multiple layers. Cross attention would also be worth mentioning, and I think language translation is a natural example to illustrate its usefulness.

Michel Speiser

2024-04-01 22:06:10 +0000 UTC

Just wanted to say a sincere thank you for this video series, watched manyyyy of your math videos a year or so ago when I was independently trying to learn the basics of transformers from a variety of sources. This series is so well explained and what pushed me to subscribe to the patreon. Looking forward to the rest!

John Dorian

2024-04-01 22:02:10 +0000 UTC

Fabulous - many thanks. And I love the parameter size math. Everyone these days seems to be working to expand the window size so LLMs can reason over larger segments of text. So I'm curious how window size relates to the other parameters you describe.

Neal McBurnett

2024-04-01 21:54:55 +0000 UTC

This is absolutely awesome. 😎 I hope you'll mention the difference between attention and self-attention in the 3rd part as that can be confusing.

Milan Todorović

2024-04-01 21:40:22 +0000 UTC

Looks great so far, Grant! I'm currently learning about attention and transformers for my job right now, so I got very excited about this set of videos. I also just finished reading chapter 12 of "Understanding Deep Learning" by S. Prince (draft PDF is available here for any other interested readers: https://udlbook.github.io/udlbook/ ) Not sure if you are aware of it already, but I found the diagrams in that chapter to be particularly helpful in understanding the relationships between Keys, Queries, and Values.

Kyle M. Kabasares

2024-04-01 20:44:34 +0000 UTC

Only minor comments! * "Examples to motivate multiple layers" -- yes! I think one major strength of your videos is precisely that they motivate these kinds of questions well and it'd be really valuable to include the example you give. * At 5:16, I had to rewatch your explanation of "anytime you see me put a matrix next to an arrow like this" several times, because I thought by "an arrow like this" I thought you meant the arrows on the vector symbols. I don't know if that's an idiosyncratic confusion, but thought I'd mention it.

Kevin L

2024-04-01 20:40:37 +0000 UTC

* I'm curious to what extent the designers of such a language model must plan out features like "we need some attention heads for adjectives, for pronouns, for (insert English grammar features here)" vs the opposite end of the spectrum of "let's set up this many dozens of attention heads and train it and then maybe look for grammar rules we're already familiar with in the mess of neurons after training is complete"? All of my arms-length research up until this point has suggested that absolutely zero understanding of grammar was inferred by the original coders, which if true would at least bode well for the model to properly learn every real-world nuance of grammar that our school lessons oversimplify and obfuscate, as well as improving its ability to understand grammar and other cultural associations in an endless number of possible foreign languages and dialects. * I'm also curious about the "rotation" that must occur between the intitial state of assigning a column to each token, and the final state of columns representing predictions of next tokens with each row representing a probability in each column for said next token being some embedding-dictionary-token-index. Like, the dimensions in embedding don't sound as though they would pull double-duty as probabilities for other token indices, but by the time all of the cooking is done that somehow has to be the case. So: is that something that happens gradually throughout the model, or is it a single step (or series of steps) of linear algebra performed near the end to fold things partly over, or is there some elements of both strategies involved? Asked another way: At step zero (each column is a pure token embedding) how much meaning does each embedding's coefficient along dimension K have relating to the token at index K? None? Some? And at step N, have the coefficients along dimension K picked up some or none of the meaning relating to token index K, at least in terms of how the entire shape of the model was initially engineered? * And finally I am curious about this 12k dimensional "concept space" in the same way that a graphic designer might be curious about a color space. We can name a lot of colors, and then plot them into a color space. But if we want to name still more colors, after we have exhausted our most obscure single-word descriptions like "vermilion, burgundy, papaya", etc we must eventually break down into phrases such as "burnt umber, mother of pearl", etc. It would be interesting to see if one could perform the opposite of an embedding step (translating a token into the concept space and getting a vector) by starting with a vector, and then using some possibly relatively simple variant of a language model to output a multi-token description of the concept pointed to by that vector. I know that with Word2Vec one can pull a word-cloud (actually a token-cloud, but IIUC for most embedding schemes some really well known entire phrases are tokens too) weighted by distance-from-point for any vector one wishes to query, but I feel it would be really diagnostically enlightening to get plain language descriptions of the encoded conceptual nature of a given exact point in the space instead of simply a list of landmarks and distances away from the point. There would obviously be some embellishment.. and running the same query multiple times with different seeds would naturally offer different descriptions, but hopefully they would all conceptually concur with one another to a fairly reliable extent. And it would allow one to better explore that semantic space and how it might be seen by a language model trained against it.

Jesse Thompson

2024-04-01 20:31:05 +0000 UTC

Attention in transformers (early view)

Comments

can’t wait to watch the third part of the LLM series!

Can’t wait for your third video in this series. Rough idea of when it will be available… :)

Will there be an episode that explains more about the fast forward network part? Also in Mixture of Experts type models maybe? Thanks!

Can you use backpropagation as described for a multi layer perceptron for a transformer, too? I would expect some adaptation is required.

Looking forward for next chapters!

I would love that too. Would also include Prompt Tuning (Soft Prompts), as it is the standard fine-tuning technique for most hyperscalers.

Thanks Chris, this was very helpful!

Why not animate at 60fps? You make so many high effort animations & making them smoother seems like an easy way to improve quality. (On YouTube the video only appears for me at 30fps / not sure how much more effort it would be to upload at 4k60)

They're created through random initialisation. They are tuned using a similar method to the ones described earlier in the original series. In fact, there really isn't a whole lot going on differently here, just a lot more operations in total.

Watched the updated version and it is much clearer now. Thank you so much!

Thank you so much for the support, Grant! Excited to hear your thoughts on my other comment as well

It's all a question of convention, I suppose, sort of like the big-endian vs. little-endian wars. It's probably my bias from a math background, but I'm partial to thinking in a column-major way.

You don't always apply this masking, and in particular, for transformers encoding images, it wouldn't be masked like this. I tried to add a few more words about this masking in the updated version, since it seemed to be a common source of confusion .

Agreed this was not at all clear in the initial draft. I've swappedped out the video above for an updated version which hopefully spells it out more clearly.

Thanks so much!

I think the first draft was a bit fast, I've tried to add more spacing and reflection in the updated version.

In the updated video, I added a bit more reflection and stepping back to give more of a chance for things to sink in. Hopefully that helps!

Nice, thanks for making this! I'll include it in the description.

Ah, well-spotted, in the updated version I've restructured it so that "crash" precedes car.

Hi Alexander, thanks for the thorough comment. I won't be able to structurally reformat things, and I do have some biases of my own, but I've definitely incorporated some of these thoughts into the updated version.

Thanks for joining, I'm glad you enjoyed it!

Hey Neal, thanks for the note. In a newer version I just added above I added a little reflection about how the attention pattern scales with context size, and how it's an active area of research to make that more scalable, hopefully that adds.

I'll probably throw in a 1-minute reference to cross attention in the final version of this one.

Hey Kevin, in a new version I just uploaded above I tried to zoom in to make that arrow point clearer, thanks for the note.

great video. One item i feel was skimmed over (and maybe intended for a later video) is how are those attention heads trained/determined?

I think i finally understood

I was similarly confused about the "car crashed" example given the triangular masking. If that is the case, how is it possible that "crashed" would ever influence the intended meaning of "car"?

I think ‘the glass ball’ example is great to explain the rationale behind the stacking of attention layers. I just have trouble to reconcile it with the findings of the following paper: https://arxiv.org/abs/2012.14913.

Fantastic video! One comment: in the video it's not clear whether W_Q is a single matrix, or whether each W_Q is a different matrix.

Looking forward to your MLP video

"Attention is all you need" - The title hits you in many ways than one.

I wanted to also ask about what it means to have a decoder-only Transformer, or Encoder only transformer or Decoder-Encoder Transformer How each is more suited to a different task (ex. Word Completion, Translation ...etc) and how (if any) the attention masking process is different

It seems like it was google after all who created the GPTs lol

Thank you so much Grant, This is the best explanation of Embeddings and Transformer models ever, I have had a look at wolfram's description and lena voita, these are nice, but for me they don't even come close to decoding important points you mentioned here, and ofc the visualizations Thanks!

Thank you for this excellent video. Cant' wait to watch the next chapter.

There is no mention of biases, only weights. I think biases deserve some mention, even if to say they are absent for simplicity's sake.

Brilliant! Thank you for this great video. I wish somewhete further in this series you'll also explain LoRA and quantization. Thanks!

But each head gets it's own learned Q and K matrices. Wouldn't you expect them to drift into separate pattern detection specialties, similar to how any two peer neurons in a single layer do?

@Stuart - This is part of my confusion as well. It *might* be that K and Q are separate so they can be trained separately, since the information they encode is different, but at inference time it feels like one should be able to effectively combine them.

Thanks @Unhappy - That is a good way to think about it! Obviously the reality of these matrices is that they encode more sophisticated relationships than just nouns and adjectives, but I like the example.

Great question. In my understanding of the video, the first token would never change, since it is not influenced by any other token during interference. Is this correct? Or does the blinding of other tokens only happen during training?

This was my question as well - how do the multiple attention heads in each layer get combined?

when you state that the attention matrix has values between negative infinity and infinity before the softmax, the example in the video seems to have values between -100 and 100, not sure if that is just coincidence but it might be confusing to someone else. otherwise, this is video is incredible!

Is attention all you need?

Interesting. I guess by keeping the tokens from influencing previous ones you do save a lot of computational work, despite maybe losing a bit of context. Didn’t even think about the vector for “red” getting influenced by “tree” though. Good point!

same here

I have a doubt related to the Temperature factor you described. When I ask ChatGPT to "think like a playwright at Walt Disney", does this adjust the Temperature in the back-end? As in making the responses more "creative sounding"?

+1. I get that we can do all that in parallel but how do we put it back together?

For the multi-headed attention, can you work through an example? I didn't understand if you took the original embedding and then sum the 'change' according to each value.

Amazing series, thanks for creating these. So here's a bit of Woo I was thinking of: all of those tunable gpt parameters remind me of a synthesizer, and in a similar way, does it produce an overall waveform? And if so, is that "thought" and analogous to brain waves?

Nevermind. On second viewing, it seems clear enough. I must have not paid enough attention.

I subscribed on patreon just to get access to this video immediately!

I was thinking of the French language, where adjectives tend to come after the noun. I get the motivation for not wanting later words to influence earlier ones, but I'd have liked to understand how the model deals with cases like these.

Agreed, this is unclear, especially since later in the video it shows an example of a later word influencing a former word.

That's the bit I had to rewatch, pause, take notes. It's all there, it's just that it's very dense, a lot packed in to a short segment. too much to understand in real time

Got a bit confused around the query, key and value part - would suggest adding a bit more information

Just an update: Watching a second time, with pauses, repeats and taking notes helped me get a much better understanding. Again, the video is excellent, it's just dense.

Yeah, I get the sense that there's wide agreement that ChatGPT has essentially passed the Turing test, and some sentiment that the Turing test wasn't that good a test actually. "AI" is whatever the models still can't do.. xD

I found myself wondering why exactly we don't want later words in the prompt to influence earlier words. What about phrases where an adverb modifies a verb, and it comes *after* the verb? "She ran quickly".

For your very first question, there's a famous mini essay called "the bitter lesson" which I think is basically standard reading for AI engineers these days. The tldr ends up being "do not try and specialise for the context, add more parameters/more compute, that will scale better in the long run"

Fabulous - many thanks. And I love the parameter size math. Everyone these days seems to be working to expand the window size so LLMs can reason over larger segments of text. So I'm curious how window size relates to the other parameters you describe.

This is absolutely awesome. 😎 I hope you'll mention the difference between attention and self-attention in the 3rd part as that can be confusing.

Related Creators

@Stuart - This is part of my confusion as well. It might be that K and Q are separate so they can be trained separately, since the information they encode is different, but at inference time it feels like one should be able to effectively combine them.

I found myself wondering why exactly we don't want later words in the prompt to influence earlier words. What about phrases where an adverb modifies a verb, and it comes after the verb? "She ran quickly".