Linear attention-based fashions are gaining consideration for his or her sooner processing velocity and comparable efficiency to Softmax transformers. Nevertheless, giant language fashions (LLMs), as a result of their giant dimension and longer sequence lengths, exert vital pressure on modern GPU {hardware} as a result of a single GPU’s reminiscence confines a language mannequin’s most sequence size.
Sequence Parallelism (SP) methods are sometimes utilized to divide a protracted sequence into a number of sub-sequences and prepare them on a number of GPUs individually. Nevertheless, present SP strategies underutilize linear consideration options, leading to inefficient parallelism and value points.
Researchers from Shanghai AI Laboratory and TapTap current the linear consideration sequence parallel (LASP) approach, which optimizes sequence parallelism on linear transformers. It employs point-to-point (P2P) communication for environment friendly state change amongst GPUs inside or throughout nodes. LASP maximizes using right-product kernel methods in linear consideration. Importantly, it doesn’t depend on consideration head partitioning, making it adaptable to multi-head, multi-query, and grouped-query attentions.
LASP employs a tiling strategy to partition enter sequences into sub-sequence chunks distributed throughout GPUs. It distinguishes consideration computation into intra-chunks and inter-chunks for using linear consideration’s right-product benefit. Intra-chunks use typical consideration computation, whereas inter-chunks exploit kernel methods. The strategy additionally contains knowledge distribution, ahead cross, and backward cross mechanisms to reinforce parallel processing effectivity.
LASP achieves vital throughput enhancement for linear consideration by means of environment friendly communication design, surpassing DeepSpeed-Ulysses by 38% and Megatron by 136% in throughput at 256K sequence size on 1B mannequin. Furthermore, LASP, with system optimizations like kernel fusion and KV State caching, helps longer sequence lengths inside the similar cluster, reaching 2048K for the 1B mannequin and 512K for the 7B mannequin.
Key contributions of this analysis are as follows:
- A brand new SP technique tailor-made to linear consideration: Enabling linear attention-based fashions to scale for lengthy sequences with out being restricted by a single GPU.
- Sequence length-independent communication over-head: Their elegant communication mechanism harnesses the right-product kernel trick of linear consideration to make sure that the exchanging of linear consideration intermediate states is sequence length-independent.
- GPU-friendly implementation: Optimized LASP’s execution on GPUs by means of meticulous system engineering, together with kernel fusion and KV State caching.
- Knowledge-parallel compatibility: LASP is suitable with all batch-level DDP strategies, akin to PyTorch/Legacy DDP, FSDP, and ZeRO-series optimizers.
In conclusion, LASP is launched to beat the constraints of current SP strategies on linear transformers by leveraging linear consideration options to reinforce parallelism effectivity and value. Implementing P2P communication, kernel fusion, and KV state caching reduces communication site visitors and improves GPU cluster utilization. Compatibility with batch-level DDP strategies ensures practicality for large-scale distributed coaching. Experiments spotlight LASP’s benefits in scalability, velocity, reminiscence utilization, and convergence efficiency in comparison with current SP strategies.
Take a look at the Paper and Github. All credit score for this analysis goes to the researchers of this mission. Additionally, don’t neglect to observe us on Twitter. Be a part of our Telegram Channel, Discord Channel, and LinkedIn Group.
If you happen to like our work, you’ll love our e-newsletter..
Don’t Neglect to affix our 39k+ ML SubReddit