Word Vectors

Word2vec: prediction function



L(θ): date Likelihood of the capacity of predicting words in context

m: size of predicting window

wt: given center word



J(θ): objective function / loss function


累乘求最大值 =》累加求最小值

Softmax Function——Scale

  • max: amplify the probability of larger xi
  • soft: still assign some probability of smaller xi

softmax(x)=exp(x)j=1nexp(xj)softmax(x) = \frac{exp(x)}{∑^n_{j=1}exp(x_j)}

Prediction Function

use two vectors per word w to simplify math and optimization and can be built easily.

  • vw: vector for center word w
  • uw: vector of context word w

c : center word

o : context word

P(oc)=exp(uoTvc)wVexp(uwTvc)P(o|c) = \frac{exp(u_o^Tv_c)}{∑_{w∈V}exp(u^T_wv_c)}

uTovc : similarity of o and c


Gradient Descent

θnew=θoldαΔθJ(θ)\theta^{new} = \theta^{old} - \alpha \Delta_\theta J(\theta)



  1. 计算量很大,因为J(θ)涉及整个语料库
  2. 难以走出鞍点/局部最优点

solution: Stochastic gradient descent(SGD) 随机梯度下降

  • 在一个小的batch里更新,即在每一个batch中,语料库只包含窗口内的所有词

2.SGNS:Skip-gram Negative sampling

J(θ)=1Tt=1TJt(θ)J(\theta) = \frac{1}{T}∑^T_{t=1}J_t(\theta)

Jt(θ)=logσ(uoTvc)+i=1kEjP(w)[logσ(ujTvc)]J_t(\theta) = log \sigma(u_o^Tv_c) + ∑^k_{i=1}E_{j\sim P(w)}[log\sigma(-u_j^Tv_c)]

σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}

use word provabili tes to choose k negative samples

  • P(w): word probabilities

  • k: number of nagative samples

对于每个正例(中央词语及上下文中的一个词语)采样几个负例(中央词语和其他随机词语),训练binary logistic regression。



  • [x] 为什么对于一个中心词只选择一个正样本呢,maybe简化的很厉害?瞅瞅代码
    • 只能说代码里面也确实只有一个正例

Language Modeling

Assigns probability to a piece of text:

P(x(1),...,x(T))=P(x(1))P(x(2)x(1))...P(x(T)x(T1),...,x(1))=t=1TP(x(t)x(t1),...,x(1))P(x^{(1)},...,x^{(T)})=P(x^{(1)})*P(x^{(2)}|x^{(1)})*...P(x^{(T)}|x^{(T-1)},...,x^{(1)}) =\prod^T_{t=1}P(x^{(t)}|x^{(t-1)},...,x^{(1)})

N-gram Language Models

4-gram Language Model/ 3rd order Markov Model

P(w4w1w2w3)=count(w1w2w3w4)count(w1w2w3)P(w_4|w_1w_2w_3) = \frac{count(w_1w_2w_3w_4)}{count(w_1w_2w_3)}

Sparsity Problem :


  • smoothing: add small count 𝛅 for every w ∈ V


  • backoff: 将4-gram改为3-gram直到bi-gram

Storage Preambles

Need to store count for all n-grams in the corpus

RNN: Recurrent Neural Networks



  1. Can process any length input
  2. Computation for step t can use information from many steps back(in theory)
  3. Model size doesn`t increase for longer input context
  4. Same weights applied on every timestep, so there is symmetry in how inputs are processed


  1. Recurrent computation is slow
  2. In practice, difficilt to access information from many steps back



Loss=J(θ)=1Tt=1TJ(t)(θ)Loss = J(\theta)=\frac{1}{T}\sum^T_{t=1}J^{(t)}(\theta)

SGD can be used:such as compute loss J(θ) for a sentence


dJ(t)dWh=i=1tdJ(t)dWh(i)\frac{\rm d J^{(t)}}{\rm d W_h} = \sum^{t}_{i=1}\frac{\rm d J^{(t)}}{\rm d W_h}|_{(i)}

The gradident a repeated weight is the sum of the gradient each time it appearsScreenshot 2022-10-25 at 17.00.50

Vanishing and Exploding Gradients


  • 起因于链式法则中的某些导数过小,导致梯度过小
  • 导致模型无法学习到距离较远的知识


  • 梯度过大,导致越过了收敛点


  1. Give a begin words/token to RNN
  2. RNN will generate text by repeated sampling

Screenshot 2022-10-25 at 16.26.37


Standard evaluation metric for Language Model is perplexity

  • lower is better

perplexity=t=1T(1PLM(x(t+1)x(t),...,x(1)))1/Tperplexity = \prod^T_{t=1}(\frac{1}{P_{LM}(x^{(t+1)}|x^{(t)},...,x^{(1)})})^{1/T}

=t=1T(1y^xt+1t)1/T=exp(J(θ))=\prod^T_{t=1}(\frac{1}{\hat{y}^{t}_{x_{t+1}}})^{1/T} =exp(J(\theta))


  • 掷出一个7面的骰子,得到一个正确答案的不确定性


  1. Benchmark task: measure our progress on understanding language
  2. Subcomponent of many NLP tasks, especially involving generating text or estimating the probability of text

Multi-layer RNNs

  1. RNNs deep on one dimension(over many timesteps)
  2. By applying multiple RNNs can learn higher-level features(sentence structure, sentiment polarity)
  3. Multi-layer RNNs also called stacked RNNs

LSTM: Long Short-Term Memory RNNs

a solution to the vanishing gradients problem

  1. on step, there is a hiiden state h(t) and a cell state c(t)

    • Both are vectors length n

    • The cell stores long-term information

    • The LSTM can read, erase, and write information from the cell(like RAM)

  2. Selection of which information is erased/writen/read is controlled by three corresponding gates

    • Both are vectors length n
    • Each timestrp, each element of the gates can be open(1), closed(0), or somewhere in-between.
    • Gates are dynamic: value is computed based on the current context

Screenshot 2022-10-26 at 14.30.14

Key: 新的information通过的方式累计在cell中,而不是乘


  1. LSTM makes it easier for the RNN to preserve information over many timesteps
  2. LSTM doesn’t guarantee that there is no vanishing/exploding gradient
  3. LSTM只能学习到到之前的信息,因此可以再使用一个独立的反向LSTM进行叠加

NMT: Neural Machine Translation-The seq2seq model

Screenshot 2022-10-26 at 15.23.30


optimized as a single system. Backpropagation operates “end-to-end”

Screenshot 2022-10-26 at 16.23.13

Greedy Decoding

In decode, always generate target sentence by taking argmax on each stop.


  1. 当前最优的迭代未必是target
  2. no way to undo decisions

Fix: Beam search decoding

On each step of decoder, keep track of the k most probable partial translations(which call hypotheses)

  • k is the beam size(in practice around 5 to 10)

  • Beam search is not guaranteed to find optimal solution ,but much efficient than exhaustive search


Beam size = k = 2


Screenshot 2022-10-26 at 16.44.49

For each of the k hypotheses, find top k next words and calculate scores.

Stop Criterion

In greedy decoding, stop when the model produces an <END> token

In beam serach decoding, different hypotheses may produce <END> token on different timesteps

  • When a hypothesis produce <END> token means it`s complete, place it aside and continue exploring other hypotheses via beam search

Stop when:

  1. We reach timesteps T(pre-defined cutoff), or
  2. We have at least n completed hypotheses(pre-defined cutoff)

Finishing up


Each hypothesis y1 ,…, yt on list has a score, but longer hypotheses have lower scores


Normalize by length


Evaluate: BLUE(Bilingual Evaluation Understudy)

BLEU conpares the machine-written-translation to one or serveral human-written-translation(s), and conpute a similarity score based on:

  1. n-gram precision(usually for 1, 2, 3 and 4-grams)

  2. Plus a penalty for too-short system translations

BlUE is useful but imperfect

  • There`re many valid ways to translate a sentence, so a good translation can get a poor BLEU score because it has a low n-grams overlap with the human translation


  1. OOV
  2. Domain mismatch between train and test data
  3. Maintaining context over long text
  4. Low-resource language pairs
  5. Failures to accurately capture sentence meaning
  6. Pronoun(or zero pronoun) resolution errors (Chinese)
  7. Morphological agreement errors(form of words and phrases)

Self-Attention and Transformers



  1. Linear interaction distance: O(sequence length)
  2. Lack of parallelizability

Fix: Word window

  • 对于每一个embedding提供一个word window以计算context,并消去timestep上的依赖
  • 当word window size = 5 时,第一层可以覆盖长度为5的上下文,二层可以覆盖长度为9的上下文
  • 但当sequence长度过长时,依旧会失去远距离的context

Screenshot 2022-10-27 at 10.12.36


  • In self-attention, the queries qi, keys ki, values vi are drawn from the same source
  • Self-attention operation is as follows:



outputi=jαijvjoutput_i = \sum_j\alpha_{ij}v_j


优势:相比全长度word window

dynamic connectivity

  1. 全连接层的权重是在训练中迭代学习,学到是你应该注意哪些神经元,但对于不同的句子有不同的结构,所需要注意的神经元也不同
  2. self-attention是queries和keys的点积,依赖于实际的文本


  1. 全连接层中,各部分是独立的连接在一起,没有与当前query交互,也没有其他的key的交互

Barriers and Solutions for Self-Attention as a building block

1.Doesn’t have an inherent notion of order

⓵Postion representation vectors through sinusoids


  • Periodicity indicates that maybe “absolute position” isn’t as important
  • Maybe can extrapolate to longer sequence as period restart


  • Not learnable
⓶Postion representation vectors learned from scratch: Learn a matrix


  • Flexibility: each position gets to be learned to fit the data


  • Definityle can`t extrapolate to indices outside 1,…T.

2.No nonlinearities for deep learning magic

Just apply the same feedforward network to each self-attention output

3.Need to ensure we don`t “look at future” when predicting

Masking the future in self-attention by setting attention scores

eij={qiTkj,i<j,ije_{ij} = \begin{cases} q_i^Tk_j, i<j \\ -\infty,i≥j \end{cases}

Transformers (what’s more )

Key-Query-Value Attention

xi,…,xT are input vectors to the Transformer encoder, xi ∈ Rd

  • ki = Kxi, K ∈ Rdxd
  • qi = Qxi, Q ∈ Rdxd
  • vi = Vxi, V ∈ Rdxd


Let X = [x1; … ; xT ] ∈ RTxd

output=softmax(XQ(XK)T)×XVoutput = softmax(XQ(XK)^T)\times XV

Screenshot 2022-10-27 at 11.41.01

Multi-headed attention

look in multiple places in the sentence at once

LetQl,Kl,VlRd×dhLet \quad Q_l,K_l,V_l ∈ R^{d\times\frac{d}{h}}

h is the number of attention heads, and l ranges from 1 to h

outputl=softmax(XQl(XKl)T)×XVl,outputlRd/houtput_l = softmax(XQ_l(XK_l)^T)\times XV_l, output_l∈R^{d/h}

output=Y[output1,...,outputh],YRd×doutput = Y[output_1,..., output_h], Y∈R^{d\times d}

Screenshot 2022-10-27 at 12.14.29


Training Tricks:

Residual Connections

Help models train better

  • Instead of X(i) = Layer(X(i-1)),i represents the layer
  • Let X(i) = X(i-1) + Layer(X(i-1))

Layer Normalization

Help models train faster

Screenshot 2022-10-27 at 12.36.34

Scale Dot Product


outputl=softmax(XQl(XKl)Td/h)×XVl,outputlRd/houtput_l = softmax(\frac{XQ_l(XK_l)^T}{\sqrt{d/h}})\times XV_l, output_l∈R^{d/h}


The keys and values are drawn from the encoder(like a memory)

The queries are drawn from the decoder

  • Let H = [h1; … ; hT ] ∈ RTxd
  • Let Z = [z1; … ; zT ] ∈ RTxd

ki = Khi , Vi = Vhi , qi = Qzi

output=softmax(ZQ(HK)T)×HVoutput = softmax(ZQ(HK)^T)\times HV