フラッシュアテンションズ長いシークエンスでの高速トランスフォーマートレーニング

FlashAttention: Speeding Up Attention for Transformers

FlashAttention is a groundbreaking algorithm that has revolutionized the training and inference process for Transformers. Developed by Adept, this algorithm addresses the compute and memory bottleneck that arises when training Transformers on long sequences. By significantly reducing the runtime and memory requirements of the attention layer, FlashAttention enables the training of large language models with longer context.

Since its release six months ago, FlashAttention has gained widespread adoption among organizations and research labs seeking to accelerate their training and inference processes. Its ability to speed up attention and reduce memory footprint without any approximation has made it a game-changer in the field of AI.

In collaboration with Adept as a part-time research fellow, I have had the opportunity to contribute to the development of FlashAttention and introduce key improvements to enhance its performance. In this blog post, I will delve into one of these improvements that we are particularly excited about: making FlashAttention fast for long sequences to enable training large language models with extended context.

Scaling Up the Context Length of Transformers

One of the challenges in scaling up the context length of Transformers is the quadratic runtime and memory requirement of the multihead attention layer at their core. The standard 2k sequence length limit is often insufficient for tasks that require understanding books, high-resolution images, webpages, multi-turn user interactions, and long-form videos.

FlashAttention addresses this challenge by reordering the attention computation and leveraging classical techniques such as tiling and recomputation to speed up the process and reduce memory usage. While FlashAttention works exceptionally well for most cases, it was not initially optimized for super long sequences with small batch sizes or a small number of heads. This is the specific scenario that we aim to optimize for.

FlashAttention achieves a significant speedup by using classical tiling techniques to load blocks of query, key, and value from the GPU’s main memory to its fast cache, compute attention with respect to that block, and write back the output to the main memory. This reduction in memory reads and writes brings about a 2-4x speedup in most cases.

Parallelization for Long Sequences

The first version of FlashAttention parallelizes over batch size and the number of heads. Each attention head is processed by one thread block, with a total of batch_size * num_heads thread blocks. This scheduling is efficient when batch_size * num_heads is large, as it allows for the effective utilization of compute resources on the GPU.

However, in the case of long sequences, which typically involve small batch sizes or a small number of heads, we have introduced an additional level of parallelization. By parallelizing over the sequence length dimension, we can better utilize the multiprocessors on the GPU, resulting in a significant speedup for this specific scenario.

The forward pass computation involves multiple workers (thread blocks) processing one attention head, with each worker handling a block of rows of the attention matrix. As the rows of the attention matrix are independent of each other, there is no need for communication between the workers.

In the backward pass, the workers now handle a block of columns of the attention matrix. Communication is required to aggregate the gradient with respect to the query, which can be achieved through atomic operations. Parallelizing by columns in the backward pass has proven to be faster than parallelizing by rows due to reduced communication between the workers.

Impressive Performance Benchmarks

FlashAttention has demonstrated remarkable performance improvements compared to Pytorch and Megatron-LM attention implementations. In benchmark tests measuring the time taken by the forward + backward pass, FlashAttention is between 2.2x and 2.7x faster for longer sequences (8k) while maintaining a small batch size and a fixed number of heads.

In end-to-end training benchmarks, FlashAttention has shown exceptional efficiency when training Transformers of sizes up to 2.7B on sequences of length 8k. With a training efficiency of up to 175 TFLOPs/sec per A100 GPU, FlashAttention outperforms Megatron-LM by 2.2 times. Training with 8k context length using FlashAttention is only 7% less hardware efficient than training with 2k context length, compared to Megatron-LM where increasing context length from 2k to 8k drops hardware efficiency by 1.9x. FlashAttention has made training on long sequences significantly easier and more efficient.

Furthermore, the quality of models trained with longer context lengths has been consistently superior. GPT3 models with 1.3B and 2.7B parameters, trained on the Pile dataset with either 2K or 8K context, have demonstrated improved performance on both pretraining metrics and downstream evaluation tasks.

Enabling the Future of AI

FlashAttention is a crucial step towards equipping models with long context, enabling them to capture the history of user interactions and remember past actions and feedback. As AI models become more personalized and multi-modal, the ability to model long context will play an even more significant role. Long context comprehension will empower models to understand books, high-resolution images, and videos.

We are incredibly excited about the possibilities that FlashAttention opens up for the future of AI. If you have an application that could benefit from these advancements, we encourage you to reach out and share your ideas with us.

Join the waitlist to be among the first to try out Adept’s groundbreaking technology.

注意

  • この記事はAI(gpt-3.5-turbo)によって自動生成されたものです。
  • この記事はHackerNewsに掲載された下記の記事を元に作成されています。
    FlashAttention: Fast Transformer training with long sequences
  • 自動生成された記事の内容に問題があると思われる場合にはコメント欄にてご連絡ください。

コメントする