Question 1

True or false: The Transformer architecture was introduced in a 2017 paper titled 'Attention Is All You Need.'

Accepted Answer

True. This is correct. The Transformer debuted in June 2017 by Vaswani et al., revolutionizing NLP by replacing recurrence with attention mechanisms.

Question 2

True or false: The original Transformer had a single attention mechanism, not multi-head attention.

Accepted Answer

False. False. The original paper introduced multi-head attention (8 heads) to let the model focus on different representation subspaces simultaneously.

Question 3

True or false: The number of attention heads in a Transformer must always be a power of two.

Accepted Answer

False. Myth. While common (e.g., 8, 16), it's not required. Models like GPT-3 use 96 heads, which isn't a power of two.

Question 4

True or false: Transformers use a fixed positional encoding based on sine and cosine functions of different frequencies.

Accepted Answer

True. True. The original paper used sinusoidal positional encodings to inject order information, though learned embeddings are now common too.

Question 5

True or false: Transformers process tokens in sequence, one word at a time, like RNNs do.

Accepted Answer

False. False. Transformers process all tokens in parallel using self-attention, not sequentially. This parallelization is a key advantage over RNNs.

Question 6

True or false: The decoder in a Transformer can attend to all tokens in the encoder output, including future ones.

Accepted Answer

False. False. The decoder uses masked self-attention to prevent attending to future tokens, ensuring autoregressive generation during training.

Question 7

True or false: Layer normalization in Transformers is applied after the residual connection, not before.

Accepted Answer

True. True. The original 'post-norm' places layer norm after the residual addition. Many modern models now use 'pre-norm' for stability.

Question 8

True or false: Transformers cannot handle variable-length inputs without padding or truncation.

Accepted Answer

False. Myth. They inherently handle variable lengths via attention masks; padding is just for batching efficiency, not a fundamental limitation.

Transformer architecture Trivia Questions