This is one of the two papers SalesForce Einstein lab published last week. Both of them requires understanding of MT, NNMT and purely attention-based NNMT. Since this first one is not too difficult to understand, I would just give you some background on NNMT first.
When NNMT was first perceived. The original form starts with an Encoder of the text, convert it to what usually known as a "thought vector". The thought vector will then decode by the Decoder. In the original setting, both Encoder and Decoder are usually LSTMs
Then there is the idea of attention. Well, you can think of it as more like an extra layer just on the thought vector on the decoder side. The goal is decide how much attention you want to pay on the thought vector.
Now of course, people have then played with various architecture for these Enc-Dec structure. The first to notice is that such structure usually has a giant LSTM or CNN. But notice that no one really like them! LSTM is hard to parallelize and CNN can consume a lot of memories.
That makes Google work mid of this year, "Attention is all you need" a stunning and useful result. What the authors were saying is proposing is to just use the idea of attention to create a system, they call it transformer. There are multiple tricks to get it work but perhaps the most important one is "multi-head attentions", in a way this is like the concepts of channels in Convnet, but now instead of doing one single attention, we are now attend in multiple places. Each head will learn to attend differently.
Naturally the method is fast because you can also parallelize it, but then Google's researchers also find it to be better in the BLEU score. That's why top house are switching to purely attention-based method these days.
Now finally I can talk about what the Salesforce paper is about. In the original Google's paper, representation learned by multi-attention heads are only concatenate with each other to form one "supervector" But then the authors of the paper decide to use another set of weighting. This again, further improve the performance on WMT14 by 0.4 BLEU score, which is quite significant.
This is the second of the two papers from Salesforce, "Non-Autoregressive Neural Machine Translation" . Unlike the "Weighted Transformer, I don't believe it improves SOTA results. But then it introduces a cute idea into a purely attention-based NNMT, I would suggest you my previous post before you read on.
Okay. The key idea introduced in the paper is fertility. So this is to address one of the issues introduced by a purely attention-based model introduced from "Attention is all you need". If you are doing translation, the translated word can 1) be expanded to multiple words, 2) transform to a totally different word location.
In the older world of statistical machine translation, or what we called IBM models. The latter model is called "Model 2" which decide the "absolute alignment" of source/target language pair. The former is called fertility model or "Model 3". Of course, in the world of NNMT, these two models were thought to be obsolete. Why not just use RNN in the Encoder/Decoder structure to solve the problem?
(Btw, there are totally 5 types IBM Models. If you are into SMT, you should probably learn it up.)
But then in the world of purely attention-based NNMT, idea such as absolute alignment and fertility become important again. Because you don't have memory within your model. So in the original "Attention is all you need" paper, there is already the thought of "positional encoding" which is to model absolute alignment.
So the new Salesforce paper actually introduces another layer which brought back fertility. Instead of just feeding the output of encoder directly into the decoder. It will feed to a fertility layer to decide if a certain word should have higher fertility first. e.g. a fertility of 2 means that it should be copied twice. 0 means the word shouldn't be copy.
I think the cute thing about the paper is two-fold. One is that it is an obvious expansion of the whole idea of attention-based NNMT . Then there is the Socher's group is reintroducing classical SMT idea back to NNMT.
The result though is not working as well as the standard NNMT. As you can see in Table 1. There is still some degradation using the attention-based approach. That's perhaps why when the Google Research Blog mention the Salesforce results : it said "towards non-autoregressive translation". It implies that the results is not yet satisfying.