CS224N
Word Vectors
Word2vec: prediction function
1.Skip-gram
已知当前词语,预测上下文
L(θ): date Likelihood of the capacity of predicting words in context
m: size of predicting window
wt: given center word
对于每一个位置t,给定窗口内每个词wt+j预测出它的概率的累乘
J(θ): objective function / loss function
累乘求最大值 =》累加求最小值
Softmax Function——Scale
- max: amplify the probability of larger xi
- soft: still assign some probability of smaller xi
Prediction Function
use two vectors per word w to simplify math and optimization and can be built easily.
- vw: vector for center word w
- uw: vector of context word w
c : center word
o : context word
uTovc : similarity of o and c
除法为了归一化
Gradient Descent
在进行梯度下降之后,vw与uw会变得十分相似,因此一般直接取他们的平均,作为改词的词向量
Problems:
- 计算量很大,因为J(θ)涉及整个语料库
- 难以走出鞍点/局部最优点
solution: Stochastic gradient descent(SGD) 随机梯度下降
- 在一个小的batch里更新,即在每一个batch中,语料库只包含窗口内的所有词
2.SGNS:Skip-gram Negative sampling
use word provabili tes to choose k negative samples
-
P(w): word probabilities
-
k: number of nagative samples
对于每个正例(中央词语及上下文中的一个词语)采样几个负例(中央词语和其他随机词语),训练binary logistic regression。
使得最大化中央词与上下文的相关概率,最小化与其他词语的概率
问题:
- [x] 为什么对于一个中心词只选择一个正样本呢,maybe简化的很厉害?瞅瞅代码
- 只能说代码里面也确实只有一个正例
Language Modeling
Assigns probability to a piece of text:
N-gram Language Models
4-gram Language Model/ 3rd order Markov Model
Sparsity Problem :
1.分子出现次数为0
- smoothing: add small count 𝛅 for every w ∈ V
2.分母出现次数为0
- backoff: 将4-gram改为3-gram直到bi-gram
Storage Preambles
Need to store count for all n-grams in the corpus
RNN: Recurrent Neural Networks
Advantages:
- Can process any length input
- Computation for step t can use information from many steps back(in theory)
- Model size doesn`t increase for longer input context
- Same weights applied on every timestep, so there is symmetry in how inputs are processed
Disadvantages:
- Recurrent computation is slow
- In practice, difficilt to access information from many steps back
Train
Loss
SGD can be used:such as compute loss J(θ) for a sentence
Backpropagation
The gradident a repeated weight is the sum of the gradient each time it appears
Vanishing and Exploding Gradients
梯度消失:
- 起因于链式法则中的某些导数过小,导致梯度过小
- 导致模型无法学习到距离较远的知识
梯度爆炸:
- 梯度过大,导致越过了收敛点
Generating
- Give a begin words/token to RNN
- RNN will generate text by repeated sampling
Evalutate
Standard evaluation metric for Language Model is perplexity
- lower is better
如果perplexity=7,则对于预测出的结果的不确定性为:
- 掷出一个7面的骰子,得到一个正确答案的不确定性
Importance
- Benchmark task: measure our progress on understanding language
- Subcomponent of many NLP tasks, especially involving generating text or estimating the probability of text
Multi-layer RNNs
- RNNs deep on one dimension(over many timesteps)
- By applying multiple RNNs can learn higher-level features(sentence structure, sentiment polarity)
- Multi-layer RNNs also called stacked RNNs
LSTM: Long Short-Term Memory RNNs
a solution to the vanishing gradients problem
-
on step, there is a hiiden state h(t) and a cell state c(t)
-
Both are vectors length n
-
The cell stores long-term information
-
The LSTM can read, erase, and write information from the cell(like RAM)
-
-
Selection of which information is erased/writen/read is controlled by three corresponding gates
- Both are vectors length n
- Each timestrp, each element of the gates can be open(1), closed(0), or somewhere in-between.
- Gates are dynamic: value is computed based on the current context
Key: 新的information通过加的方式累计在cell中,而不是乘
summary:
- LSTM makes it easier for the RNN to preserve information over many timesteps
- LSTM doesn’t guarantee that there is no vanishing/exploding gradient
- LSTM只能学习到到之前的信息,因此可以再使用一个独立的反向LSTM进行叠加
NMT: Neural Machine Translation-The seq2seq model
Train
optimized as a single system. Backpropagation operates “end-to-end”
Greedy Decoding
In decode, always generate target sentence by taking argmax on each stop.
Problem:
- 当前最优的迭代未必是target
- no way to undo decisions
Fix: Beam search decoding
On each step of decoder, keep track of the k most probable partial translations(which call hypotheses)
-
k is the beam size(in practice around 5 to 10)
-
Beam search is not guaranteed to find optimal solution ,but much efficient than exhaustive search
Example:
Beam size = k = 2
For each of the k hypotheses, find top k next words and calculate scores.
Stop Criterion
In greedy decoding, stop when the model produces an <END> token
In beam serach decoding, different hypotheses may produce <END> token on different timesteps
- When a hypothesis produce <END> token means it`s complete, place it aside and continue exploring other hypotheses via beam search
Stop when:
- We reach timesteps T(pre-defined cutoff), or
- We have at least n completed hypotheses(pre-defined cutoff)
Finishing up
Problem:
Each hypothesis y1 ,…, yt on list has a score, but longer hypotheses have lower scores
Fix:
Normalize by length
Evaluate: BLUE(Bilingual Evaluation Understudy)
BLEU conpares the machine-written-translation to one or serveral human-written-translation(s), and conpute a similarity score based on:
-
n-gram precision(usually for 1, 2, 3 and 4-grams)
-
Plus a penalty for too-short system translations
BlUE is useful but imperfect
- There`re many valid ways to translate a sentence, so a good translation can get a poor BLEU score because it has a low n-grams overlap with the human translation
Difficulties
- OOV
- Domain mismatch between train and test data
- Maintaining context over long text
- Low-resource language pairs
- Failures to accurately capture sentence meaning
- Pronoun(or zero pronoun) resolution errors (Chinese)
- Morphological agreement errors(form of words and phrases)
Self-Attention and Transformers
Self-Attention
RNN的问题:
- Linear interaction distance: O(sequence length)
- Lack of parallelizability
Fix: Word window
- 对于每一个embedding提供一个word window以计算context,并消去timestep上的依赖
- 当word window size = 5 时,第一层可以覆盖长度为5的上下文,二层可以覆盖长度为9的上下文
- 但当sequence长度过长时,依旧会失去远距离的context
改进:Self-Attention
- In self-attention, the queries qi, keys ki, values vi are drawn from the same source
- Self-attention operation is as follows:
优势:相比全长度word window
dynamic connectivity :
- 全连接层的权重是在训练中迭代学习,学到是你应该注意哪些神经元,但对于不同的句子有不同的结构,所需要注意的神经元也不同
- self-attention是queries和keys的点积,依赖于实际的文本
interaction:
- 全连接层中,各部分是独立的连接在一起,没有与当前query交互,也没有其他的key的交互
Barriers and Solutions for Self-Attention as a building block
1.Doesn’t have an inherent notion of order
⓵Postion representation vectors through sinusoids
Pros:
- Periodicity indicates that maybe “absolute position” isn’t as important
- Maybe can extrapolate to longer sequence as period restart
Cons:
- Not learnable
⓶Postion representation vectors learned from scratch: Learn a matrix
Pros:
- Flexibility: each position gets to be learned to fit the data
Cons:
- Definityle can`t extrapolate to indices outside 1,…T.
2.No nonlinearities for deep learning magic
Just apply the same feedforward network to each self-attention output
3.Need to ensure we don`t “look at future” when predicting
Masking the future in self-attention by setting attention scores
Transformers (what’s more )
Key-Query-Value Attention
xi,…,xT are input vectors to the Transformer encoder, xi ∈ Rd
- ki = Kxi, K ∈ Rdxd
- qi = Qxi, Q ∈ Rdxd
- vi = Vxi, V ∈ Rdxd
Calculate:
Let X = [x1; … ; xT ] ∈ RTxd
Multi-headed attention
look in multiple places in the sentence at once
h is the number of attention heads, and l ranges from 1 to h
并不会增加实际的运算量
Training Tricks:
Residual Connections
Help models train better
- Instead of X(i) = Layer(X(i-1)),i represents the layer
- Let X(i) = X(i-1) + Layer(X(i-1))
Layer Normalization
Help models train faster
Scale Dot Product
当向量维度过大,点积就很容易变得非常大,导致softmax变得peaky,从而导致其他其余梯度过小
Cross-attention
The keys and values are drawn from the encoder(like a memory)
The queries are drawn from the decoder
- Let H = [h1; … ; hT ] ∈ RTxd
- Let Z = [z1; … ; zT ] ∈ RTxd
ki = Khi , Vi = Vhi , qi = Qzi