Beyond Attention: How the Differential Transformer is Revolutionizing Language Models
A discussion of "Differential Transformer" (2024) by Ye, Tianzhu et al.
Introduction
Imagine you’re going through a document, trying to find an answer. However, there’s just so much going on due to the vast amount of information that you get lost within the irrelevant details. Because of this, you are unable to pinpoint the most accurate and exact answer to your question. This is something that LLMs also endure, as it processes so much information to the point where it may struggle to focus on the exact piece of information required to respond to a user query. This is where the Differential Transformer comes into use: it provides a new architecture for large language models that mitigates this exact issue of being unable to differentiate between the accurate answer and the “noise” by allowing for the LLM to focus on what’s most important.
What is the Core Issue?
Transformer architectures have become the backbone of many powerful LLMs such as GPT and Gemini, but with its strengths come its limitations. Some of these limitations include attention to noise. This means that the standard architecture of a Transformer may allocate some of its “attention” to irrelevant data.
For example, let’s say you ask an AI support system with the standard transformer architecture the question, “Why is my laptop overheating when I run a program?”. The system would have already processed a customer user manual that may have FAQs such as troubleshooting wifi connectivity, system set up, and device warranty. Of course, this is not an extensive list, but the system would have also allocated some of its attention to these topics among others. However, because the query mentioned “overheating”, it may turn to focusing on general heating limits on the device as opposed to managing overheating when running particular programs. It may then give an unrelated output as opposed to the answer to the question the user asked. This can lead to customer frustration, as the system is not able to differentiate between the various pieces of data that contain or refer to the word “overheating” and the exact piece of data that addresses the question of overheating when running a program.
This is exactly what the DIFF transformer addresses in its implementation.
The DIFF Transformer
If we compare the standard transformer’s architecture’s “search” mechanism to a flashlight, the DIFF transformer’s mechanism can be likened to a laser pointer. Instead of processing irrelevant details and then later struggling to capture the key points needed to answer a query, the DIFF transformer focuses solely on the most important information, capturing only what is needed to respond.
To do this, the DIFF transformer implements what is called a differential attention mechanism which has the sole purpose of canceling out this “noise”. This is the main difference between this transformer and the standard transformer, introduced in the paper “Attention is All You Need” (2017). The standard transformer calculates what is called an attention score, which indicates the relevance of a specific word within a sequence. These scores are placed into a map, where each key and its respective value are represented as vectors. It then calculates this score using something called Scaled Dot-Product Attention, which involves using the softmax function on the dot product between the keys and the original query. These scores are placed into a softmax attention map, which would then be used in responding to queries and generating the most accurate answer possible.
However, in the DIFF transformer, the same query and key vectors are split into two different groups. Each group is then run through the softmax function, creating two different maps. Attention scores are then generated by subtracting one attention map from another. This allows it to filter out any “distractions” and instead retains the most relevant information.
Furthermore, it also retains the same multi-head structure of the standard Transformer but each head performs this differential attention independently. This means that each head would take in the two softmax attention maps and calculate the difference, and the results of each head would then be normalized to create the final results.
As illustrated below, we can see the multi-head approach as well as the difference between two softmax maps approach that DIFF employs.
Advantages and Results
There are remarkable improvements noted in DIFF in comparison to the standard Transformer, some of the highlights including (but not limited to):
Language Modeling: The DIFF transformer (6.8B-parameter) demonstrated a comparably similar validation loss to the standard transformer (11B-parameter) with only 62.2% of parameters, indicating that in regards to parameter count, the transformer is scalable.
Key Information Retrieval: In multi-needle retrieval from 4K context length with number of needles N = 6 and number of query cities R = 2, DIFF achieved an accuracy that is 30% higher than that of the standard transformer. However, when the needles are positioned at a 25% depth accuracy within 64K context, DIFF has a 76% accuracy improvement in comparison to the standard transformer. This suggests that DIFF can better handle complex queries with fewer errors.
In-Context Learning: In the comparison of accuracy of many-shot classification between DIFF and Transformer, DIFF consistently scored higher than the standard. The average accuracy ranges from 5.2% to 21.6%, emphasizing the significant improvement in accuracy between the two transformers.
Contextual Hallucination Mitigation: DIFF also has less hallucinations (i.e. is more accurate, doesn’t make content up) in comparison to Transformer, with it scoring consistently higher in lack of hallucinations on summarization datasets XSum, CNN/DM, and MultiNews. The same results show in the accuracy comparisons for question-answering datasets Qasper, HotpotQA, and 2WikiMQA.
Activation Outliers: Activation outliers, which indicate outliers in your data, are lower in DIFF in comparison to Transformer. For attention logits, the DIFF transformer has a top-1 activation value of 38.8 while the standard transformer has a value of 318.0. For hidden states, DIFF’s top-1 activation value is 1688.2 while the standard transformer is 3608.6. This essentially ensures stability and consistency in its performance.
Training Efficiency: The throughput of DIFF in comparison to Transformer is also significantly similar, as it demonstrates that DIFF’s throughput with a model size of 3B and a context length of 2K is only 9% lower than Transformer.
Final Thoughts
The results of this new architecture indicates a transformer that is more robust and accurate, as it minimizes the inaccuracies that are brought about by irrelevant details in text. This has the potential to allow for more smarter, efficient, and accurate AI systems which are, ultimately, more user friendly. It also has a lot of impact in various industries, such as education, healthcare, and customer service, where users can rely on such AI systems to provide accurate and succinct answers to their queries.