Word Vectors

Word2vec: prediction function

1.Skip-gram

已知当前词语,预测上下文

L(θ): date Likelihood of the capacity of predicting words in context

m: size of predicting window

wt: given center word

对于每一个位置t,给定窗口内每个词wt+j预测出它的概率的累乘

L(θ)=t=1Tmjm,j0P(wt+jwt;θ)L(\theta)=∏_{t=1}^{T}∏_{-m≤j≤m,j≠0}P(w_{t+j}|w_t;\theta)

J(θ): objective function / loss function

J(θ)=1TlogL(θ)=1Tt=1Tmjm,j0logP(wt+jwt;θ)J(\theta)=-\frac{1}{T}logL(\theta)=-\frac{1}{T}∑_{t=1}^{T}∑_{-m≤j≤m,j≠0}logP(w_{t+j}|w_t;\theta)

累乘求最大值 =》累加求最小值

Softmax Function——Scale

  • max: amplify the probability of larger xi
  • soft: still assign some probability of smaller xi

softmax(x)=exp(x)j=1nexp(xj)softmax(x) = \frac{exp(x)}{∑^n_{j=1}exp(x_j)}

Prediction Function

use two vectors per word w to simplify math and optimization and can be built easily.

  • vw: vector for center word w
  • uw: vector of context word w

c : center word

o : context word

P(oc)=exp(uoTvc)wVexp(uwTvc)P(o|c) = \frac{exp(u_o^Tv_c)}{∑_{w∈V}exp(u^T_wv_c)}

uTovc : similarity of o and c

除法为了归一化

Gradient Descent

θnew=θoldαΔθJ(θ)\theta^{new} = \theta^{old} - \alpha \Delta_\theta J(\theta)

在进行梯度下降之后,vw与uw会变得十分相似,因此一般直接取他们的平均,作为改词的词向量

Problems:

  1. 计算量很大,因为J(θ)涉及整个语料库
  2. 难以走出鞍点/局部最优点

solution: Stochastic gradient descent(SGD) 随机梯度下降

  • 在一个小的batch里更新,即在每一个batch中,语料库只包含窗口内的所有词

2.SGNS:Skip-gram Negative sampling

J(θ)=1Tt=1TJt(θ)J(\theta) = \frac{1}{T}∑^T_{t=1}J_t(\theta)

Jt(θ)=logσ(uoTvc)+i=1kEjP(w)[logσ(ujTvc)]J_t(\theta) = log \sigma(u_o^Tv_c) + ∑^k_{i=1}E_{j\sim P(w)}[log\sigma(-u_j^Tv_c)]

σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}

use word provabili tes to choose k negative samples

  • P(w): word probabilities

  • k: number of nagative samples

对于每个正例(中央词语及上下文中的一个词语)采样几个负例(中央词语和其他随机词语),训练binary logistic regression。

使得最大化中央词与上下文的相关概率,最小化与其他词语的概率

问题:

  • [x] 为什么对于一个中心词只选择一个正样本呢,maybe简化的很厉害?瞅瞅代码
    • 只能说代码里面也确实只有一个正例

Language Modeling

Assigns probability to a piece of text:

P(x(1),...,x(T))=P(x(1))P(x(2)x(1))...P(x(T)x(T1),...,x(1))=t=1TP(x(t)x(t1),...,x(1))P(x^{(1)},...,x^{(T)})=P(x^{(1)})*P(x^{(2)}|x^{(1)})*...P(x^{(T)}|x^{(T-1)},...,x^{(1)}) =\prod^T_{t=1}P(x^{(t)}|x^{(t-1)},...,x^{(1)})

N-gram Language Models

4-gram Language Model/ 3rd order Markov Model

P(w4w1w2w3)=count(w1w2w3w4)count(w1w2w3)P(w_4|w_1w_2w_3) = \frac{count(w_1w_2w_3w_4)}{count(w_1w_2w_3)}

Sparsity Problem :

1.分子出现次数为0

  • smoothing: add small count 𝛅 for every w ∈ V

2.分母出现次数为0

  • backoff: 将4-gram改为3-gram直到bi-gram

Storage Preambles

Need to store count for all n-grams in the corpus

RNN: Recurrent Neural Networks

img

Advantages:

  1. Can process any length input
  2. Computation for step t can use information from many steps back(in theory)
  3. Model size doesn`t increase for longer input context
  4. Same weights applied on every timestep, so there is symmetry in how inputs are processed

Disadvantages:

  1. Recurrent computation is slow
  2. In practice, difficilt to access information from many steps back

Train

Loss

Loss=J(θ)=1Tt=1TJ(t)(θ)Loss = J(\theta)=\frac{1}{T}\sum^T_{t=1}J^{(t)}(\theta)

SGD can be used:such as compute loss J(θ) for a sentence

Backpropagation

dJ(t)dWh=i=1tdJ(t)dWh(i)\frac{\rm d J^{(t)}}{\rm d W_h} = \sum^{t}_{i=1}\frac{\rm d J^{(t)}}{\rm d W_h}|_{(i)}

The gradident a repeated weight is the sum of the gradient each time it appearsScreenshot 2022-10-25 at 17.00.50

Vanishing and Exploding Gradients

梯度消失:

  • 起因于链式法则中的某些导数过小,导致梯度过小
  • 导致模型无法学习到距离较远的知识

梯度爆炸:

  • 梯度过大,导致越过了收敛点

Generating

  1. Give a begin words/token to RNN
  2. RNN will generate text by repeated sampling

Screenshot 2022-10-25 at 16.26.37

Evalutate

Standard evaluation metric for Language Model is perplexity

  • lower is better

perplexity=t=1T(1PLM(x(t+1)x(t),...,x(1)))1/Tperplexity = \prod^T_{t=1}(\frac{1}{P_{LM}(x^{(t+1)}|x^{(t)},...,x^{(1)})})^{1/T}

=t=1T(1y^xt+1t)1/T=exp(J(θ))=\prod^T_{t=1}(\frac{1}{\hat{y}^{t}_{x_{t+1}}})^{1/T} =exp(J(\theta))

如果perplexity=7,则对于预测出的结果的不确定性为:

  • 掷出一个7面的骰子,得到一个正确答案的不确定性

Importance

  1. Benchmark task: measure our progress on understanding language
  2. Subcomponent of many NLP tasks, especially involving generating text or estimating the probability of text

Multi-layer RNNs

  1. RNNs deep on one dimension(over many timesteps)
  2. By applying multiple RNNs can learn higher-level features(sentence structure, sentiment polarity)
  3. Multi-layer RNNs also called stacked RNNs

LSTM: Long Short-Term Memory RNNs

a solution to the vanishing gradients problem

  1. on step, there is a hiiden state h(t) and a cell state c(t)

    • Both are vectors length n

    • The cell stores long-term information

    • The LSTM can read, erase, and write information from the cell(like RAM)

  2. Selection of which information is erased/writen/read is controlled by three corresponding gates

    • Both are vectors length n
    • Each timestrp, each element of the gates can be open(1), closed(0), or somewhere in-between.
    • Gates are dynamic: value is computed based on the current context

Screenshot 2022-10-26 at 14.30.14

Key: 新的information通过的方式累计在cell中,而不是乘

summary:

  1. LSTM makes it easier for the RNN to preserve information over many timesteps
  2. LSTM doesn’t guarantee that there is no vanishing/exploding gradient
  3. LSTM只能学习到到之前的信息,因此可以再使用一个独立的反向LSTM进行叠加

NMT: Neural Machine Translation-The seq2seq model

Screenshot 2022-10-26 at 15.23.30

Train

optimized as a single system. Backpropagation operates “end-to-end”

Screenshot 2022-10-26 at 16.23.13

Greedy Decoding

In decode, always generate target sentence by taking argmax on each stop.

Problem:

  1. 当前最优的迭代未必是target
  2. no way to undo decisions

Fix: Beam search decoding

On each step of decoder, keep track of the k most probable partial translations(which call hypotheses)

  • k is the beam size(in practice around 5 to 10)

  • Beam search is not guaranteed to find optimal solution ,but much efficient than exhaustive search

Example:

Beam size = k = 2

score(y1,....yt)=i=1tlogPLM(yiy1,....,yi1,x)score(y_1,....y_t)=\sum^t_{i=1}logP_{LM}(y_i|y_1,....,y_{i-1},x)

Screenshot 2022-10-26 at 16.44.49

For each of the k hypotheses, find top k next words and calculate scores.

Stop Criterion

In greedy decoding, stop when the model produces an <END> token

In beam serach decoding, different hypotheses may produce <END> token on different timesteps

  • When a hypothesis produce <END> token means it`s complete, place it aside and continue exploring other hypotheses via beam search

Stop when:

  1. We reach timesteps T(pre-defined cutoff), or
  2. We have at least n completed hypotheses(pre-defined cutoff)

Finishing up

Problem:

Each hypothesis y1 ,…, yt on list has a score, but longer hypotheses have lower scores

Fix:

Normalize by length

1ti=1tlogPLM(yiy1,....,yi1,x)=scoreh/lengthh\frac{1}{t}\sum^t_{i=1}logP_{LM}(y_i|y_1,....,y_{i-1},x)=score_h/length_h

Evaluate: BLUE(Bilingual Evaluation Understudy)

BLEU conpares the machine-written-translation to one or serveral human-written-translation(s), and conpute a similarity score based on:

  1. n-gram precision(usually for 1, 2, 3 and 4-grams)

  2. Plus a penalty for too-short system translations

BlUE is useful but imperfect

  • There`re many valid ways to translate a sentence, so a good translation can get a poor BLEU score because it has a low n-grams overlap with the human translation

Difficulties

  1. OOV
  2. Domain mismatch between train and test data
  3. Maintaining context over long text
  4. Low-resource language pairs
  5. Failures to accurately capture sentence meaning
  6. Pronoun(or zero pronoun) resolution errors (Chinese)
  7. Morphological agreement errors(form of words and phrases)

Self-Attention and Transformers

Self-Attention

RNN的问题

  1. Linear interaction distance: O(sequence length)
  2. Lack of parallelizability

Fix: Word window

  • 对于每一个embedding提供一个word window以计算context,并消去timestep上的依赖
  • 当word window size = 5 时,第一层可以覆盖长度为5的上下文,二层可以覆盖长度为9的上下文
  • 但当sequence长度过长时,依旧会失去远距离的context

Screenshot 2022-10-27 at 10.12.36

改进:Self-Attention

  • In self-attention, the queries qi, keys ki, values vi are drawn from the same source
  • Self-attention operation is as follows:

eij=qiTkie_{ij}=q_i^Tk_i

αij=exp(eij)jexp(eij)\alpha_{ij}=\frac{exp(e_{ij})}{\sum_{j'}exp(e_{ij'})}

outputi=jαijvjoutput_i = \sum_j\alpha_{ij}v_j

在这里插入图片描述

优势:相比全长度word window

dynamic connectivity

  1. 全连接层的权重是在训练中迭代学习,学到是你应该注意哪些神经元,但对于不同的句子有不同的结构,所需要注意的神经元也不同
  2. self-attention是queries和keys的点积,依赖于实际的文本

interaction

  1. 全连接层中,各部分是独立的连接在一起,没有与当前query交互,也没有其他的key的交互

Barriers and Solutions for Self-Attention as a building block

1.Doesn’t have an inherent notion of order

⓵Postion representation vectors through sinusoids

Pros:

  • Periodicity indicates that maybe “absolute position” isn’t as important
  • Maybe can extrapolate to longer sequence as period restart

Cons:

  • Not learnable
⓶Postion representation vectors learned from scratch: Learn a matrix

Pros:

  • Flexibility: each position gets to be learned to fit the data

Cons:

  • Definityle can`t extrapolate to indices outside 1,…T.

2.No nonlinearities for deep learning magic

Just apply the same feedforward network to each self-attention output

3.Need to ensure we don`t “look at future” when predicting

Masking the future in self-attention by setting attention scores

eij={qiTkj,i<j,ije_{ij} = \begin{cases} q_i^Tk_j, i<j \\ -\infty,i≥j \end{cases}

Transformers (what’s more )

Key-Query-Value Attention

xi,…,xT are input vectors to the Transformer encoder, xi ∈ Rd

  • ki = Kxi, K ∈ Rdxd
  • qi = Qxi, Q ∈ Rdxd
  • vi = Vxi, V ∈ Rdxd

Calculate:

Let X = [x1; … ; xT ] ∈ RTxd

output=softmax(XQ(XK)T)×XVoutput = softmax(XQ(XK)^T)\times XV

Screenshot 2022-10-27 at 11.41.01

Multi-headed attention

look in multiple places in the sentence at once

LetQl,Kl,VlRd×dhLet \quad Q_l,K_l,V_l ∈ R^{d\times\frac{d}{h}}

h is the number of attention heads, and l ranges from 1 to h

outputl=softmax(XQl(XKl)T)×XVl,outputlRd/houtput_l = softmax(XQ_l(XK_l)^T)\times XV_l, output_l∈R^{d/h}

output=Y[output1,...,outputh],YRd×doutput = Y[output_1,..., output_h], Y∈R^{d\times d}

Screenshot 2022-10-27 at 12.14.29

并不会增加实际的运算量

Training Tricks:

Residual Connections

Help models train better

  • Instead of X(i) = Layer(X(i-1)),i represents the layer
  • Let X(i) = X(i-1) + Layer(X(i-1))

Layer Normalization

Help models train faster

Screenshot 2022-10-27 at 12.36.34

Scale Dot Product

当向量维度过大,点积就很容易变得非常大,导致softmax变得peaky,从而导致其他其余梯度过小

outputl=softmax(XQl(XKl)Td/h)×XVl,outputlRd/houtput_l = softmax(\frac{XQ_l(XK_l)^T}{\sqrt{d/h}})\times XV_l, output_l∈R^{d/h}

Cross-attention

The keys and values are drawn from the encoder(like a memory)

The queries are drawn from the decoder

  • Let H = [h1; … ; hT ] ∈ RTxd
  • Let Z = [z1; … ; zT ] ∈ RTxd

ki = Khi , Vi = Vhi , qi = Qzi

output=softmax(ZQ(HK)T)×HVoutput = softmax(ZQ(HK)^T)\times HV