ELMo模型详解及使用

模型详解

论文中分两个训练过程:

  1. 生成字符级别的embedding,根据字符级别的embedding来生成上下文无关的word embedding
  2. 使用bi-lstm语言模型生成上下文相关的word embedding

下面我们分别详细介绍这两个训练过程对应的模型:

char cnn embedding

参考论文:Character-Aware Neural Language Models

char-cnn

通过上图的CNN结构,得到了维度为2048的word embedding。

在这个CNN结构上,还应用了其他trick,比如highway network,projection layer。

  • highway network

    假设通过卷积层得到单词k的向量为,highway network对应的function为,此处的g对应非线性激活函数,通常为relu,而为sigmoid函数:

  • Projection layer

    projection layer就是把前面的词向量维度适配到bilstm语言模型要求的词向量维度大小。

bilstm语言模型

ELMo使用双层BiLSTM来训练语言模型,然后通过线性组合不同lstm层的word vectors, 得到最终的word embedding vectors。线性组合公式如下所示:

公式(1)的直观解释如下图所示:

elmo-combine-1

语言模型就是简单的两层bilstm语言模型,现在的重点是如何学习式(1)中的参数以及

如何学习以及

通过阅读代码来理解如何学习这些参数(函数weight_layers):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
def weight_layers(name, bilm_ops, l2_coef=None,
use_top_only=False, do_layer_norm=False):
'''
Weight the layers of a biLM with trainable scalar weights to
compute ELMo representations.
For each output layer, this returns two ops. The first computes
a layer specific weighted average of the biLM layers, and
the second the l2 regularizer loss term.
The regularization terms are also add to tf.GraphKeys.REGULARIZATION_LOSSES
Input:
name = a string prefix used for the trainable variable names不重要,可以是任意字符串
bilm_ops = This is the return value from BidirectionalLanguageModel(...)
l2_coef: the l2 regularization coefficient
use_top_only: if True, then only use the top layer.
do_layer_norm: if True, then apply layer normalization to each biLM
layer before normalizing
Output:
{
'weighted_op': op to compute weighted average for output,
'regularization_op': op to compute regularization term
}
'''
def _l2_regularizer(weights):
if l2_coef is not None:
return l2_coef * tf.reduce_sum(tf.square(weights))
else:
return 0.0

# Get ops for computing LM embeddings and mask
lm_embeddings = bilm_ops['lm_embeddings']
mask = bilm_ops['mask']

n_lm_layers = int(lm_embeddings.get_shape()[1]) #一共多少层bilstm
lm_dim = int(lm_embeddings.get_shape()[3]) #word vector维度

with tf.control_dependencies([lm_embeddings, mask]):
# Cast the mask and broadcast for layer use.
mask_float = tf.cast(mask, 'float32')
broadcast_mask = tf.expand_dims(mask_float, axis=-1)

def _do_ln(x):
# do layer normalization excluding the mask
x_masked = x * broadcast_mask
N = tf.reduce_sum(mask_float) * lm_dim
mean = tf.reduce_sum(x_masked) / N
variance = tf.reduce_sum(((x_masked - mean) * broadcast_mask)**2
) / N
return tf.nn.batch_normalization(
x, mean, variance, None, None, 1E-12
)

if use_top_only:
layers = tf.split(lm_embeddings, n_lm_layers, axis=1)
# just the top layer
sum_pieces = tf.squeeze(layers[-1], squeeze_dims=1)
# no regularization
reg = 0.0
else:
W = tf.get_variable(
'{}_ELMo_W'.format(name),
shape=(n_lm_layers, ),
initializer=tf.zeros_initializer,
regularizer=_l2_regularizer,
trainable=True,
)

# normalize the weights
normed_weights = tf.split(
tf.nn.softmax(W + 1.0 / n_lm_layers), n_lm_layers
)
# split LM layers
layers = tf.split(lm_embeddings, n_lm_layers, axis=1)

# compute the weighted, normalized LM activations
pieces = []
for w, t in zip(normed_weights, layers):
if do_layer_norm:
pieces.append(w * _do_ln(tf.squeeze(t, squeeze_dims=1)))
else:
pieces.append(w * tf.squeeze(t, squeeze_dims=1))
sum_pieces = tf.add_n(pieces)

# get the regularizer
reg = [
r for r in tf.get_collection(
tf.GraphKeys.REGULARIZATION_LOSSES)
if r.name.find('{}_ELMo_W/'.format(name)) >= 0
]
if len(reg) != 1:
raise ValueError

# scale the weighted sum by gamma
gamma = tf.get_variable(
'{}_ELMo_gamma'.format(name),
shape=(1, ),
initializer=tf.ones_initializer,
regularizer=None,
trainable=True,
)
weighted_lm_layers = sum_pieces * gamma

ret = {'weighted_op': weighted_lm_layers, 'regularization_op': reg}

return ret

三种使用方式

  1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.
  2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.
  3. Precompute the representations for your entire dataset and save to a file.

官方提供的tensorflow版本的elmo模型代码,结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
├── bilm                  // ELMo is implemented under this folder.
│ ├── __init__.py
│ ├── data.py // Data Loading & Batch Generation
│ ├── elmo.py // ``weight_layers`` for step (3).
│ ├── model.py // ``BidirectionalLanguageModel`` for step (3)
│ └── training.py // Model Definition (step (1) and (2)).

├── bin // CLI Scripts & Config
│ ├── dump_weights.py // Dump weight file for AllenNLP.
│ ├── restart.py // For step (2).
│ ├── run_test.py // Check the perplexity on the heldout dataset.
│ └── train_elmo.py // For step (1).

...
├── usage_cached.py // usage_*.py are for step (3).
├── usage_character.py
└── usage_token.py

关于elmo的使用方式的更多细节,请参考:usage_xxx.py这三个文件即可。

使用示例

github:elmo-trainging-tutorial

参考资料