Disrupting the AI Building Paradigm: The Rise of Distributed Large Language Model Training
I. Introduction
Researchers have harnessed GPUs distributed globally to train a novel large – language model (LLM), incorporating both private and public data. This endeavor signals a potential disruption to the prevailing method of artificial intelligence development.
II. The Collaborative Creation of Collective – 1
Flower AI and Vana, two startups adopting unconventional AI – building strategies, joined forces to develop the model named Collective – 1.
1. Flower AI’s Contribution
Flower AI devised techniques enabling training to be distributed across hundreds of internet – connected computers. Their technology is already in use by certain firms, facilitating AI model training without the need for aggregating compute resources or data.
2. Vana’s Contribution
Vana furnished data sources, including private messages from platforms such as X, Reddit, and Telegram.
III. The Scale of Collective – 1
By contemporary standards, Collective – 1 is relatively small, boasting 7 billion parameters – the values that endow the model with its capabilities. In contrast, today’s most advanced models, like those powering ChatGPT, Claude, and Gemini, have hundreds of billions of parameters.
IV. The Promise of the Distributed Approach
Nic Lane, a computer scientist from the University of Cambridge and co – founder of Flower AI, contends that the distributed approach holds the promise of scaling far beyond the size of Collective – 1. Flower AI is currently in the process of training a 30 – billion – parameter model using conventional data and plans to train a 100 – billion – parameter model – approaching the scale of industry leaders – later this year. Lane emphasizes, “It could truly transform the way everyone perceives AI, so we’re vigorously pursuing this.” The startup is also integrating images and audio into the training process to develop multimodal models.
V. Impact on AI Industry Power Dynamics
Distributed model – building has the potential to disrupt the power dynamics that have defined the AI industry.
1. Current Industry Model
Presently, AI companies construct their models by amassing vast amounts of training data and substantial compute resources concentrated within data centers filled with advanced GPUs interconnected via super – fast fiber – optic cables. They also heavily rely on datasets scraped from publicly accessible, albeit sometimes copyrighted, materials like websites and books.
2. Leveling the Playing Field
This conventional approach restricts the development of powerful and valuable models to only the wealthiest companies and nations with access to large quantities of the most powerful chips. Even open – source models, such as Meta’s Llama and DeepSeek’s R1, are developed by companies with access to large data centers. In contrast, distributed approaches could enable smaller companies and universities to build advanced AI by pooling diverse resources. Additionally, countries lacking traditional infrastructure could network multiple data centers to create more powerful models.
VI. Distributed AI Training Methodology
- Rethinking Calculation Division
Distributed AI training necessitates re – evaluating how the calculations for constructing powerful AI systems are partitioned. Creating an LLM involves feeding copious amounts of text into a model, which then adjusts its parameters to generate useful responses to prompts. In a data center, the training process is divided so that parts can be executed on different GPUs and periodically integrated into a single master model. - The New Distributed Approach
The novel approach allows the tasks typically carried out within a large data center to be performed on hardware located miles apart and connected via a relatively slow or variable internet connection.
VII. Industry Players’ Exploration of Distributed Learning
Some major industry players are also delving into distributed learning. For instance, last year, Google researchers demonstrated a new method for dividing and consolidating computations called DIstributed PAth COmposition (DiPaCo), which enables more efficient distributed learning.
VIII. Photon: A Tool for Efficient Distributed Training
To build Collective – 1 and other LLMs, Lane and his academic collaborators in the UK and China developed a new tool named Photon. Photon enhances the efficiency of distributed training. According to Lane, Photon improves upon Google’s approach through a more efficient data – representation method within a model and a more effective scheme for sharing and consolidating training. Although the process is slower than conventional training, it offers greater flexibility, allowing for the addition of new hardware to intensify training. Photon was developed in collaboration with researchers from Beijing University of Posts and Telecommunications and Zhejiang University in China and was released under an open – source license in March, enabling anyone to utilize this approach.
IX. Vana’s Initiative in Data Sharing
Vana, Flower AI’s partner in building Collective – 1, is pioneering new ways for users to share personal data with AI developers. Vana’s software enables users to contribute private data from platforms like X and Reddit to large – language – model training. Users can potentially specify the permitted end – uses and even receive financial benefits from their contributions. Anna Kazlauskas, co – founder of Vana, states that the aim is to make untapped data available for AI training while giving users more control over how their information is utilized for AI. “This is data that is typically not incorporated into AI models due to its non – public availability,” Kazlauskas notes. “This is the first instance where data directly contributed by users is being used to train a foundation model, with users having ownership of the AI model created from their data.”
X. Expert Insights
Mirco Musolesi, a computer scientist at University College London, highlights that a significant advantage of the distributed approach to AI training is likely its ability to unlock new types of data. “Scaling this to frontier models would enable the AI industry to leverage vast amounts of decentralized and privacy – sensitive data, such as in healthcare and finance, for training without the risks associated with data centralization,” he says. Helen Toner, an AI governance expert at the Center for Security and Emerging Technology, deems Flower AI’s approach “interesting and potentially very relevant” to AI competition and governance, adding, “It will probably continue to struggle to keep up with the frontier but could be an interesting fast – follower approach.”
XI. Call to Action
What are your thoughts on distributed machine learning? Would you contribute your data to a model like Collective – 1? Share your views by sending an email to hello@wired.com or leaving a comment below.
