会议识别中基于递归神经网络的语言建模外文翻译资料

 2023-06-27 09:06

Recurrent Neural Network based Language Modeling in Meeting Recognition

We use recurrent neural network (RNN) based language mod- els to improve the BUT English meeting recognizer. On the baseline setup using the original language models we decrease word error rate (WER) more than 1% absolute by n-best list rescoring and language model adaptation. When n-gram lan- guage models are trained on the same moderately sized data set as the RNN models, improvements are higher yielding a system which performs comparable to the baseline. A noticeable im- provement was observed with unsupervised adaptation of RNN models. Furthermore, we examine the influence of word history on WER and show how to speed-up rescoring by caching common prefix strings.

Index Terms:automatic speech recognition, language model- ing, recurrent neural networks, rescoring, adaptation

1. Introduction

Neural network (NN) based language models as proposed in have been continuously reported to perform well amongst other language modeling techniques. The best results on some smaller tasks were obtained by using recurrent NN-based lan- guage models . In RNNs, the feedback between hid- den and input layer allows the hidden neurons to remember the history of previously processed words.

Neural networks in language modeling offer several advan- tages. In contrary to commonly used n-gram language models, smoothing is applied in an implicit way, and due to the projec- tion of the entire vocabulary into a small hidden layer, seman- tically similar words get clustered. This explains, why n-gram counts of data sampled from the distribution defined by NN- based models could lead to better estimates for n-grams, which may have never been seen during training: Words get substi- tuted by other words which the NN learned to be related. Whileno such relation could be learned by a standard n-gram model using the original sparse training data, we already showed in how we can incorporate some of the improvements gained by RNN language models into systems using just standard n-gram language models: by generating a large amount of additional training data from the RNN distribution.The purpose of this paper is to show, to what extent the current RNN language model is suitable for mass application in common LVCSR systems. We will show that the promising results of previously conducted experiments on smaller setups generalize to our state-of-the-art meeting recognizer and can be applied in fact in any other ASR system without toomuch effort. While RNN models effectively complement stan- dard n-grams, they can be used also efficiently, even in systems where speed or memory consumption is an issue.In the following, we briefly introduce the utilized class- based RNN architecture for language modeling. A system de- scription and details about used language models follows. Fi- nally, we present our experiments in detail and conclude with a summary of our findings.

Figure 1: Architecture of the class-based recurrent NN.

2. Class-based RNN language model

The RNN language model operates as a predictive model for the next word given the previous ones. As in n-gram models, the joint probability of a given sequence of words is factorized into the product of probability estimates of all words wi conditioned on their history hi = w1w2...wiminus;1:

The utilized RNN architecture is shown in fifigure 1. The previous word wii 1 is fed to the input of the net using 1-of-n encoding1 together with the information encoded in the state vector sii 1 from processing the previous words. By propagating the input layer we obtain the updated state vector si so that we can write:

Usually, the posterior probability of the predicted word is estimated by using a softmax activation function on the out put layer, which has the size of the vocabulary. The posterior probability for any given wi can be read immediately from the corresponding output. Although often just the posterior probability of a particular wi is required, the entire distribution has to be computed because of the softmax. By assuming, that words can be mapped to classes surjectively, we can add a part for es- timating the posterior probability of classes to the output layer, and hence estimate the probability for the predicted word as the product of two independent probability distributions - one over classes and the other one over words within a class:

This leads to speed-up both in training and testing because only the distribution over classes and then the distribution over words belonging to the class ci of the predicted word have to be computed .

3. Setup

3.1 System description

Our state-of-the-art baseline speech recognition system uses acoustic and language models from the AMIDA Rich Transcrip- tion 2009 system. Standard speaker adaptation techniques (VTLN and per-speaker CMLLR), fMPE MPE trained acous- tic models and NN-bottleneck features with CVN/CMN and HLDA are used. The output of two complementary branches (one based on PLP and the other based on posterior features) served for cross-adapting the system. In both branches, lattices are generated using a 2-gram language model and subsequently expanded up to 4-gram order. The estimated adaptation trans- formations are used in a lattice rescoring stage, whose lattices finally serve as input to RNN rescoring as performed later in the experiments.

<t

剩余内容已隐藏,支付完成后下载完整资料</t


英语译文共 8 页,剩余内容已隐藏,支付完成后下载完整资料


资料编号:[603193],资料为PDF文档或Word文档,PDF文档可免费转换为Word

Corpus

Words

RT09

RT11

RNN

Web data

931M

!

Hub4

152M

!

33M

Fisher 1/2

21M

!

!

!

您需要先支付 30元 才能查看全部内容!立即支付

课题毕业论文、文献综述、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找。