Present long-context giant language fashions (LLMs) can course of inputs as much as 100,000 tokens, but they wrestle to generate outputs exceeding even a modest size of two,000 phrases. Managed experiments reveal that the mannequin’s efficient era size is inherently restricted by the examples seen throughout supervised fine-tuning (SFT). In different phrases, this output limitation stems from the shortage of long-output examples in current SFT datasets.
Latest developments in long-context LLMs have led to the event of fashions with considerably expanded reminiscence capacities, able to processing historical past exceeding 100,000 tokens in size. Nevertheless, regardless of their means to deal with in depth inputs, present long-context LLMs wrestle to generate equally prolonged outputs.
To discover this limitation, LongWriter probes the utmost output size of state-of-the-art long-context fashions with a number of queries that require responses of various lengths, equivalent to “Write a ten,000-word article on the historical past of the Roman Empire.” The outcomes present that every one fashions constantly fail to supply outputs past 2,000 phrases in size. In the meantime, evaluation of person interplay logs reveals that over 1% of person prompts explicitly request outputs exceeding this restrict, highlighting a urgent want in present analysis to beat this limitation.
To handle this, LongWriter introduces AgentWrite, an agent-based pipeline that decomposes ultra-long era duties into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 phrases. Leveraging AgentWrite, LongWriter constructs LongWriter-6k, a dataset containing 6,000 SFT knowledge samples with output lengths starting from 2k to 32k phrases. By incorporating this dataset into mannequin coaching, LongWriter efficiently scales the output size of current fashions to over 10,000 phrases whereas sustaining output high quality.
LongWriter additionally develops LongBench-Write, a complete benchmark for evaluating ultra-long era capabilities. The 9B parameter mannequin, additional improved by means of DPO, achieves state-of-the-art efficiency on this benchmark, surpassing even a lot bigger proprietary fashions.
On this article, we are going to focus on the LongWriter framework, discover its structure, and evaluate its efficiency towards state-of-the-art long-context giant language fashions. Let’s get began.
Latest developments in lengthy context giant language fashions (LLMs) have led to the creation of fashions with considerably elevated reminiscence capacities, able to processing histories that exceed 100,000 tokens. Regardless of this means to deal with in depth inputs, present long-context LLMs wrestle to generate outputs of comparable size. To analyze this limitation, LongWriter examines the utmost output size of state-of-the-art long-context fashions by means of varied queries that require completely different response lengths, equivalent to “Write a ten,000-word article on the historical past of the Roman Empire.” Primarily based on the findings, LongWriter observes that every one fashions constantly fail to generate outputs longer than 2,000 phrases. Moreover, an evaluation of person interplay logs signifies that over 1% of person prompts particularly request outputs past this restrict, highlighting an pressing want in present analysis to handle this concern.
LongWriter’s examine reveals a key perception: the constraint on output size is primarily rooted within the traits of the Supervised Effective-Tuning (SFT) datasets. Particularly, LongWriter finds {that a} mannequin’s most era size is successfully capped by the higher restrict of output lengths current in its SFT dataset, regardless of its publicity to for much longer sequences in the course of the pretraining section. This discovering explains the ever-present 2,000-word era restrict throughout present fashions, as current SFT datasets hardly ever include examples exceeding this size. Moreover, as many datasets are distilled from state-of-the-art LLMs, in addition they inherit the output size limitation from their supply fashions.
To handle this limitation, LongWriter introduces AgentWrite, a novel agent-based pipeline designed to leverage off-the-shelf LLMs to routinely assemble prolonged, coherent outputs. AgentWrite operates in two phases: First, it crafts an in depth writing plan outlining the construction and goal phrase rely for every paragraph based mostly on the person’s enter. Then, following this plan, it prompts the mannequin to generate content material for every paragraph in a sequential method. LongWriter’s experiments validate that AgentWrite can produce high-quality and coherent outputs of as much as 20,000 phrases.
Constructing upon the AgentWrite pipeline, LongWriter leverages GPT-4o to generate 6,000 long-output SFT knowledge, named LongWriter-6k, and provides this knowledge to coach current fashions. Notably, LongWriter-6k efficiently unlocks the mannequin’s means to generate well-structured outputs exceeding 10,000 phrases in size. To scrupulously consider the effectiveness of this strategy, LongWriter develops the LongBench-Write benchmark, which comprises a various set of person writing directions, with output size specs starting from 0-500 phrases, 500-2,000 phrases, 2,000-4,000 phrases, and past 4,000 phrases. Analysis on LongBench-Write reveals that LongWriter’s 9B dimension mannequin achieves state-of-the-art efficiency, even in comparison with bigger proprietary fashions. LongWriter additional constructs choice knowledge and makes use of DPO to assist the mannequin higher comply with lengthy writing directions and generate increased high quality written content material, which has additionally been confirmed efficient by means of experiments.
To summarize, LongWriter’s work makes the next novel contributions:
- Evaluation of Technology Size Limits: LongWriter identifies the first issue limiting the output size of present long-context LLMs, which is the constraint on the output size within the SFT knowledge.
- AgentWrite: To beat this limitation, LongWriter proposes AgentWrite, which makes use of a divide-and-conquer strategy with off-the-shelf LLMs to routinely assemble SFT knowledge with ultra-long outputs. Utilizing this methodology, LongWriter constructs the LongWriter-6k dataset.
- Scaling Output Window Dimension of Present LLMs: LongWriter incorporates the LongWriter-6k dataset into its SFT knowledge, efficiently scaling the output window dimension of current fashions to 10,000+ phrases with out compromising output high quality. LongWriter reveals that DPO additional enhances the mannequin’s long-text writing capabilities.
AgentWrite: Computerized Knowledge Development
To make the most of off-the-shelf LLMs for routinely producing SFT knowledge with longer outputs, LongWriter designs AgentWrite, a divide-and-conquer model agent pipeline. AgentWrite first breaks down lengthy writing duties into a number of subtasks, with every subtask requiring the mannequin to jot down just one paragraph. The mannequin then executes these subtasks sequentially, and LongWriter concatenates the subtask outputs to acquire the ultimate lengthy output. Such an strategy of breaking down a posh process into a number of subtasks utilizing LLM brokers has already been utilized in varied fields, equivalent to problem-solving, software program growth, and mannequin analysis. LongWriter’s work is the primary to discover integrating planning to allow fashions to finish advanced long-form writing duties. Every step of AgentWrite is launched intimately under.
Step I: Plan
Impressed by the thought means of human writers, who sometimes begin by making an total plan for lengthy writing duties, LongWriter makes use of the planning capabilities of LLMs to output such a writing define given a writing instruction. This plan consists of the principle content material and phrase rely necessities for every paragraph. The immediate utilized by LongWriter is as follows:
“I would like you to assist me break down the next long-form writing instruction into a number of subtasks. Every subtask will information the writing of 1 paragraph within the essay and will embody the details and phrase rely necessities for that paragraph. The writing instruction is as follows: {Consumer Instruction}. Please break it down within the following format, with every subtask taking over one line:
Paragraph 1 – Primary Level: [Describe the main point of the paragraph, in detail] – Phrase Rely: [Word count requirement, e.g., 400 words]
Paragraph 2 – Primary Level: [Describe the main point of the paragraph, in detail] – Phrase Rely: [Word count requirement, e.g. 1000 words].Be sure that every subtask is obvious and particular, and that every one subtasks cowl the complete content material of the writing instruction. Don’t break up the subtasks too finely; every subtask’s paragraph must be a minimum of 200 phrases and not more than 1000 phrases. Don’t output every other content material.”
Step II: Write
After acquiring the writing plan from Step I, LongWriter calls the LLM serially to finish every subtask, producing the writing content material part by part. To make sure the coherence of the output, when LongWriter calls the mannequin to generate the n-th part, the beforehand generated n−1 sections are additionally enter, permitting the mannequin to proceed writing the subsequent part based mostly on the prevailing writing historical past. Though this serial method prevents parallel calls to the mannequin to finish a number of subtasks concurrently, and the enter size turns into longer, LongWriter reveals in validation that the general coherence and high quality of the writing obtained this fashion are far superior to the output generated in parallel. The immediate in use by LongWriter is:
“You’re a superb writing assistant. I will provide you with an authentic writing instruction and my deliberate writing steps. I may even offer you the textual content I’ve already written. Please assist me proceed writing the subsequent paragraph based mostly on the writing instruction, writing steps, and the already written textual content.
Writing instruction:
{Consumer Instruction}
Writing steps:
{The writing plan generated in Step I}
Already written textual content:
{Earlier generated (n-1) paragraphs}
Please combine the unique writing instruction, writing steps, and the already written textual content, and now proceed writing {The plan for the n-th paragraph, i.e., the n-th line within the writing plan}.”
Validation
LongWriter checks the era size and high quality of the proposed AgentWrite methodology on two long-form writing datasets. The primary one, LongWrite-Ruler, is used to measure precisely how lengthy of an output the strategy can present. The second, LongBench-Write, is principally used to judge how nicely the model-generated content material aligns with person directions by way of size and writing high quality.
LongBench-Write: To judge the mannequin’s efficiency on a extra numerous vary of long-form writing directions, LongWriter collects 120 different person writing prompts, with 60 in Chinese language and 60 in English. To raised assess whether or not the mannequin’s output size meets person necessities, LongWriter ensures that every one these directions embody specific phrase rely necessities. These directions are divided into 4 subsets based mostly on the phrase rely necessities: 0-500 phrases, 500-2,000 phrases, 2,000-4,000 phrases, and over 4,000 phrases. Moreover, the directions are categorized into seven sorts based mostly on the output kind: Literature and Inventive Writing, Educational and Monograph, Common Science, Useful Writing, Information Report, Group Discussion board, and Training and Coaching.
Throughout analysis, LongWriter adopts two metrics: one for scoring the output size and one other for scoring the output high quality. The mannequin’s output size is scored based mostly on how shut it’s to the necessities specified within the directions. For output high quality, LongWriter makes use of the LLM-as-a-judge strategy, choosing the state-of-the-art GPT-4o mannequin to attain the output throughout six dimensions: Relevance, Accuracy, Coherence, Readability, Breadth and Depth, and Studying Expertise. The ultimate rating is computed by averaging the size rating and the standard rating.
Validation outcomes: LongWriter presents the output size measurement on LongWrite-Ruler and finds that AgentWrite efficiently extends the output size of GPT-4o from a most of 2k phrases to roughly 20k phrases. LongWriter additionally assesses each the output high quality and adherence to the required output size on LongBench-Write, exhibiting that GPT-4o can efficiently full duties with outputs beneath 2,000 phrases in size when evaluating AgentWrite’s efficiency.
Supervised Effective-Tuning
LongWriter conducts coaching based mostly on two of the most recent open-source fashions, specifically GLM-4-9B and Llama-3.1-8B. Each of those are base fashions and help a context window of as much as 128k tokens, making them naturally appropriate for coaching on lengthy outputs. To make the coaching extra environment friendly, LongWriter adopts packing coaching with loss weighting. The coaching on the 2 fashions leads to two fashions: LongWriter-9B (abbreviated for GLM-4-9B-LongWriter) and LongWriter-8B (abbreviated for Llama-3.1-8B-LongWriter).
On the similar time, LongWriter notices that if the loss is averaged by sequence, i.e., taking the imply of every sequence’s common loss inside a batch, the contribution of every goal token to the loss in lengthy output knowledge can be considerably lower than these with shorter outputs. In LongWriter’s experiments, it’s also discovered that this results in suboptimal mannequin efficiency on duties with lengthy outputs. Due to this fact, LongWriter chooses a loss weighting technique that averages the loss by token, the place the loss is computed because the imply of losses throughout all goal tokens inside that batch.
All fashions are skilled utilizing a node with 8xH800 80G GPUs and DeepSpeed+ZeRO3+CPU offloading. LongWriter makes use of a batch dimension of 8, a studying fee of 1e-5, and a packing size of 32k. The fashions are skilled for 4 epochs, which takes roughly 2,500-3,000 steps.
Alignment (DPO)
To additional enhance the mannequin’s output high quality and improve its means to comply with size constraints in directions, LongWriter performs direct choice optimization (DPO) on the supervised fine-tuned LongWriter-9B mannequin. The DPO knowledge comes from GLM-4’s chat DPO knowledge (roughly 50k entries). Moreover, LongWriter constructs 4k pairs of information particularly concentrating on long-form writing directions. For every writing instruction, LongWriter samples 4 outputs from LongWriter-9B and scores these outputs following a particular methodology. A length-following rating can be mixed as computed. The best-scoring output is then chosen because the optimistic pattern, and one of many remaining three outputs is randomly chosen because the detrimental pattern.
The ensuing mannequin, LongWriter-9B-DPO, is skilled for 250 steps on the above knowledge combination. LongWriter follows a particular recipe for DPO coaching.
LongWriter: Experiments and Outcomes
LongWriter evaluates 4 proprietary fashions and 5 open-source fashions on LongBench-Write, together with the skilled LongWriter fashions. To the perfect of LongWriter’s information, Suri-IORPO is the one prior mannequin that can be aligned for long-form textual content era. It’s skilled based mostly on Mistral-7B-Instruct-v0.2 utilizing LoRA. In line with the analysis setup on LongWrite-Ruler, LongWriter units the output temperature to 0.5 and configures the mannequin’s era max tokens parameter to the utmost allowed by its API name. For open-source fashions, it’s set to 32,768.
Most earlier fashions are unable to satisfy the size requirement of over 2,000 phrases, whereas LongWriter fashions constantly present longer and richer responses to such prompts.
Observing the output size rating SlS_lSl for prompts in every required size vary, LongWriter finds that earlier fashions usually carry out poorly (scoring under 70) on prompts within the [2k, 4k) range, with only Claude 3.5 Sonnet achieving a decent score. For prompts in the [4k, 20k) range, almost all previous models are completely unable to reach the target output length, even scoring 0 (meaning all output lengths are less than one-third of the required length). By adding training data from LongWriter-6k, LongWriter’s trained model can effectively reach the required output length while maintaining good quality, as suggested by the scores in the [2k, 20k) range and the scatter plots.
DPO effectively improves both the model’s output quality and its ability to follow length requirements in long generation.
By comparing the scores of LongWriter-9B and LongWriter9B-DPO, we find that DPO significantly improves both Sl (+4%) and Sq (+3%) scores, and the improvement is consistent across all ranges. This shows that in long generation scenario, DPO still helps to improve the model’s output quality and can better align the model’s output length with 8 Preprint Figure 7: Cumulative average NLL loss of GLM4-9B and Llama-3.1-8B at different positions of LongWriter models’ outputs. Figure 8: LongWrite-Ruler test results of LongWriter models, showing their maximum generation lengths between 10k-20k words. the requested length. The latter conclusion has also been recently observed in Yuan et al. (2024) in shorter generations. We also manually annotate pairwise wins and losses for GPT-4o and three longwriter models on their outputs in LongBench-Write and visualize the results in Figure 9. We can see that humans prefer the DPO-trained model over LongWriter-9B in 58% of the cases. Moreover, despite having fewer parameters, LongWriter-9B-DPO achieves a tie with GPT-4o.
The output length limit of the LongWriter models is extended to between 10k and 20k words, while more data with long outputs is required to support even longer outputs.
Following the LongWrite-Ruler tes,we also present the LongWrite-Ruler test results of LongWriter models. The results suggest that their maximum generation lengths are between 10k-20k words. The lack of SFT data with longer outputs is likely the primary reason preventing the model from achieving longer output lengths.
Final Thoughts
In this work, we have talked about LongWriter, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, identifies a 2,000-word generation limit for current LLMs and proposes increasing their output window size by adding long-output data during alignment. To automatically construct long-output data, LongWriter develops AgentWrite, an agent-based pipeline that uses off-the-shelf LLMs to create extended, coherent outputs. LongWriter successfully scales the output window size of current LLMs to over 10,000 words with the constructed LongWriter-6k. Extensive ablation studies on the training data demonstrate the effectiveness of this approach. For future work, LongWriter suggests the following three directions: 1. Expand the AgentWrite framework to construct data with longer outputs to further extend LLMs’ output window size. 2. Refine the AgentWrite framework to achieve higher quality long-output data. 3. Longer model outputs bring challenges to inference efficiency. Several methods have been proposed to improve inference efficiency. It is worth investigating how these methods can ensure improved model efficiency without compromising generation quality.