background-image: url('../figs/title.png') --- class: center, middle # Chapter 11 - Training and Sampling for Sequences --- # Generating with RNNs
graph TB x1["[start]"] --> r1[RNN] r1 --> y1[1]
(source)
Usually a special [start] token to begin generation Feed output into input of RNN at next time step Generated: [start] 1 --- # Generating with RNNs
graph TB x1["[start]"] --> r1[RNN] r1 --> y1[1] x2[1] --> r2[RNN] r2 --> y2[+]
(source)
Usually a special [start] token to begin generation Feed output into input of RNN at next time step Generated: [start] 1 + --- # Generating with RNNs
graph TB x1["[start]"] --> r1[RNN] r1 --> y1[1] x2[1] --> r2[RNN] r2 --> y2[+] x3[+] --> r3[RNN] r3 --> y3[1]
(source)
Usually a special [start] token to begin generation Feed output into input of RNN at next time step Generated: [start] 1 + 1 --- # Generating with RNNs
graph TB x1["[start]"] --> r1[RNN] r1 --> y1[1] x2[1] --> r2[RNN] r2 --> y2[+] x3[+] --> r3[RNN] r3 --> y3[1] x4[1] --> r4[RNN] r4 --> y4[=]
(source)
Usually a special [start] token to begin generation Feed output into input of RNN at next time step Generated: [start] 1 + 1 = But wait, how are we even getting the output? --- # Generating with RNNs
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[??]
(source)
Output is a probability distribution: `$$Pr(x_n \mid x_1, x_2, \dots x_{n-1})$$` | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.05|0.1 |0.7 |0.05|0.05|0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.05|0.0 |0.0 |0.0 | --- # Take the Max
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[??]
(source)
One option: take the max over possible next tokens | 0 | 1 | **2** | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.05|0.1 |**0.7** |0.05|0.05|0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.05|0.0 |0.0 |0.0 | Generated: [start] 1 + 1 = 2 --- # Problem with Max
graph TB x1["[start]"] --> r1[RNN] r1 --> y1[1] x2[1] --> r2[RNN] r2 --> y2[+] x3[+] --> r3[RNN] r3 --> y3[1] x4[1] --> r4[RNN] r4 --> y4[+] x5[+] --> r5[RNN] r5 --> y5[1] x6[1] --> r6[RNN] r6 --> y6[+] x7[+] --> r7[RNN] r7 --> y7[...]
(source)
Problem: some tokens much more likely at all times (think: 'the', 'a', 'and', 'on', etc in text data) End up generating boring input, e.g. Generated: 1 + 1 + 1 + 1 + ... --- # Sampling from RNNs
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[4]
(source)
Different option, sample from probability distribution randomly | 0 | 1 | 2 | 3 |**4** | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.05|0.1 |0.7 |0.05|**0.05**|0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.05|0.0 |0.0 |0.0 | Generated: [start] 1 + 1 = 4 --- # Sampling from RNNs
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[4] x6[4] --> r6[RNN] r6 --> y6[-]
(source)
Different option, sample from probability distribution randomly | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.05 |0.0 |0.0 |0.05 |0.0 |0.0 |0.0 |0.1 |0.0 |0.0 |0.1 |**0.6** |0.0 |0.0 |0.0 | Generated: [start] 1 + 1 = 4 - --- # Sampling from RNNs
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[4] x6[4] --> r6[RNN] r6 --> y6[-] x7[-] --> r7[RNN] r7 --> y7[2]
(source)
Different option, sample from probability distribution randomly | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.0 |0.05 |**0.9** |0.05 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 | Generated: [start] 1 + 1 = 4 - 2 --- # Benefits of Sampling More diverse output! Usually safe when output is obvious (Pr > .9) Can generate synonymous statements: The dog went to the park The dog walked to the park The dog trotted to the park ... --- # Problem with Sampling
graph TB x1["[start]"] --> r1[RNN] x2[1] --> r2[RNN] x3[+] --> r3[RNN] x4[1] --> r4[RNN] x5[=] --> r5[RNN] r5 --> y5[4] x6[4] --> r6[RNN] r6 --> y6[-] x7[-] --> r7[RNN] r7 --> y7[2]
(source)
Sometimes you make mistakes! Randomness can bite you | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | - | * | / | = | |-----------------------------------------------------------| |0.0 |**0.05** |0.9 |0.05 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 | Generated: [start] 1 + 1 = 4 - 1 --- # Problem with Sampling Want a way to generate diverse output BUT Don't want it to be too out there (i.e. still want correct grammar/syntax/sentence structure, etc) --- # Solution: Beam Search Generate tokens via sampling to increase diversity Remember multiple possible generated options Prune unlikely options, expand more likely ones Return most likely option (or something smarter?) --- # Solution: Beam Search Generate tokens via sampling to increase diversity Remember multiple possible generated options Prune unlikely options, expand more likely ones Return most likely option (or something smarter?) --- # Beam Search Algorithm beam_search(model, n, length) beams = [] for i in 0..length: new_beams = [] for beam in beams: compute Pr(x | beam) sample n items from x: new_beam = beam new_beam += sample new_beams.append(new_beam) sort(new_beams) # sort by sum of log prob beams = new_beams[:n] # only keep n total beams --- # Beam Search Prob 0.5: [start] 1 Prob 0.3: [start] 2 Prob 0.1: [start] 4 --- # Beam Search Expand [start] 1 + [start] 1 - [start] 1 = [start] 2 - [start] 2 + [start] 2 * [start] 4 - [start] 4 / [start] 4 = --- # Beam Search Evaluate Prob 0.2: [start] 1 + Prob 0.1: [start] 1 - Prob 0.4: [start] 1 = Prob 0.2: [start] 2 - Prob 0.1: [start] 2 + Prob 0.1: [start] 2 * Prob 0.3: [start] 4 - Prob 0.5: [start] 4 / Prob 0.1: [start] 4 = --- # Beam Search Prune Prob 0.5: [start] 4 / Prob 0.4: [start] 1 = Prob 0.3: [start] 4 - --- # Beam Search Expand [start] 4 / 2 [start] 4 / 1 [start] 4 / 1 [start] 1 = 3 [start] 1 = 2 [start] 1 = 1 [start] 4 - 4 [start] 4 - 0 [start] 4 - 2 .... etc. --- # Temperature: Controlling the Randomness Remember, our model predicts unnormalized probabilities We take softmax to convert to probabilities `$$\sigma(x) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$` | $$x$$ | 1 | -4 | 2 | 4 | 3 | |--------------------------------------------------------| |$$\sigma(x)$$|.0321 | .0002 | .0871 | .6438 | .2368 | No real "reason" this is the right end distribution. We may want to encourage diversity, make our distribution more random. Or we may want to focus only on most likely elements. --- # Temperature: Controlling the Randomness Softmax with temperature parameter, first divide by temperature then apply softmax: `$$\sigma_t(x) = \frac{e^{x_i/t}}{\sum_j e^{x_j/t}}$$` | $$x$$ | 1 | -4 | 2 | 4 | 3 | |--------------------------------------------------------| |$$\sigma_1(x)$$|.0321 | .0002 | .0871 | .6438 | .2368 | --- # Low Temperature: More Like Argmax Softmax with temperature parameter, first divide by temperature then apply softmax: `$$\sigma_t(x) = \frac{e^{x_i/t}}{\sum_j e^{x_j/t}}$$` | $$x$$ | 1 | -4 | 2 | 4 | 3 | |--------------------------------------------------------| |$$\sigma_1(x)$$|.0321 | .0002 | .0871 | .6438 | .2368 | |$$\sigma_{0.3}(x)$$|.0000| .0000| .0012| .9643| .0344 | --- # High Temperature: More Uniform Softmax with temperature parameter, first divide by temperature then apply softmax: `$$\sigma_t(x) = \frac{e^{x_i/t}}{\sum_j e^{x_j/t}}$$` | $$x$$ | 1 | -4 | 2 | 4 | 3 | |--------------------------------------------------------| |$$\sigma_1(x)$$|.0321 | .0002 | .0871 | .6438 | .2368 | |$$\sigma_{0.3}(x)$$|.0000| .0000| .0012| .9643| .0344 | |$$\sigma_{10}(x)$$|.1893| .1148| .2092| .2555| .2312| --- # Softmax Temperature Temperature doesn't effect ordering of elements, just the relative magnitude High temperature: more "random", less reliant on model output Low temperature: less "random", magnifies relative differences in model output | $$x$$ | 1 | -4 | 2 | 4 | 3 | |--------------------------------------------------------| |$$\sigma_1(x)$$|.0321 | .0002 | .0871 | .6438 | .2368 | |$$\sigma_{0.3}(x)$$|.0000| .0000| .0012| .9643| .0344 | |$$\sigma_{10}(x)$$|.1893| .1148| .2092| .2555| .2312| [Colab demo here!](https://colab.research.google.com/drive/1Q1AIGsizj3fGX3uao_jkygheyNfCz_q_?usp=sharing) --- # Evaluating RNNs How do we evaluate our sequence models? One option: accuracy Given sequence `\(x_1, x_2, \dots, x_n\)` how often does our model predict ground truth label `\(x_{n+1}\)` However, often our models have a large vocabulary and there are ambiguities! Accuracy may be very small even for good models --- # Evaluating RNNs Another option: likelihood Instead, measure the probability our model assigns to the ground truth: `$$ \text{likelihood} = \prod_{i} \Pr(x_i \mid x_1, x_2, \dots, x_{i-1})$$` However these probabilities are often quite small (especially with large vocabularies) and will underflow floats. So in practice often use log-likelihood: `$$ \text{log-likelihood} = \sum_{i} \log[ \Pr(x_i \mid x_1, x_2, \dots, x_{i-1})]$$` Sometimes instead of maximizing likelihood we want to minimize something so you can also think about negative log-likelihood (which is a common loss function). --- # Negative log-likelihood and cross-entropy There's another term you may have heard of, entropy or cross-entropy. The cross-entropy between two probability distributions p and q is: `$$H(p,q) = - \sum_x p(x) \log q(x)$$` In our case we can think of p(x) as the ground truth labels and q(x) as our model predictions. Since our labels are one-hot encodings of the correct word/token, p(x) will be 1 if x is the next token and 0 for all other tokens, thus cross-entropy is the same as our negative log-likelihood: `$$ H = \text{NLL} = - \sum_{i} \log[ \Pr(x_i \mid x_1, x_2, \dots, x_{i-1})]$$` --- # Perplexity It seems like cross entropy or NLL would be fine for determining how good a model is. We use them for our loss function! But noooo, NLP people have to be special so they use something else called **perplexity**. It's not even that different though, you just raise entropy to an exponent: `$$PP = 2^{\frac{1}{N} - \sum_i \log \Pr(x_i)} $$` but also you can caluculate it with a base `\(e\)`: `$$PP = e^{\frac{1}{N} - \sum_i \ln \Pr(x_i)} $$` Or if your model already calculates cross-entropy for it's loss function you can just raise that to a power: `$$PP = e^H $$` You can use likelihood or cross-entropy or perplexity. Just make sure if you compare to other models that you have the details consistent! (like what exponent, etc)