Input Embedding
To be updated...
Word Vector Embeddings
A conceptual understanding of word vector embeddings is pretty much fundamental to understanding natural language processing. In essence, a word vector embedding takes individual words and translates them into a vector which somehow represents its meaning.
The details can vary from implementation to implementation, but the end result can be thought of as a “space of words”, where the space obeys certain convenient relationships. Words are hard to do math on, but vectors which contain information about a word, and how they relate to other words, are significantly easier to do math on. This task of converting words to vectors is often referred to as an “embedding”.
The most common way to generate word vector embeddings is to use a neural network. The neural network is trained on a large corpus of text, and learns to predict the surrounding words of a given word. The weights of the neural network are then used as the word vector embeddings.
As the state of the art has progressed, word embeddings have maintained an important tool, with GloVe, Word2Vec, and FastText all being popular choices. Sub-word embeddings are generally much more powerful than full word embeddings.
The Landmark Paper, Neural Machine Translation by Jointly Learning to Align and Translate popularized the general concept of attention and was the conceptual precursor to the multi-headed self attention mechanisms used in transformers.
In other words, it decides which inputs are currently relevant, and which inputs are not currently relevant.