Introducing the largest multilingual open science model ever formed

While they regularly deliver excellent results, large AI models are generally black boxes: it is not known exactly how they calculate their answers and many elements are not made public. The BigScience project, in which a thousand researchers participate in a participatory and open scientific approach, is changing the game with Bloom.

It is the largest fully open and transparently trained multilingual language model. This type of AI simultaneously learns the text generation model and the text representation model by repeatedly performing an initial task: predicting the next word of text whose beginning is known, so as to make keyboards “smart” . In addition to managing 46 languages, ranging from English to Basque, its character open flag This will help scientists from all walks of life explore how language models work to improve them. The BigScience project, initiated by Hugging Face, was supported by the CNRS, GENCI and the Ministry of Higher Education and Scientific Research, allowing Bloom to be trained on the “Jean Zay” machine, one of the most powerful in Europe. Philippe Lavocate, Chairman and Chief Executive Officer of GENCI, announces:

“BigScience launches the world first and paves the way for new scientific discoveries. It took advantage of the resources of the Jean Zay convergent supercomputer, one of the most powerful computers in Europe, commissioned in 2019 in the wake of the Artificial Intelligence Plan for Humanity. Today, more than 1,000 research projects are working to mobilize its resources. Key to this success, the extension of Jan Zee’s work at the beginning of the year is the result of joint work between the Ministry of Higher Education and Research, the National Center for Scientific Research through the Institute for Development and Resources in Scientific Computing (IDRIS), and GENCI”

Language models are artificial intelligences whose first applications relate to texts in natural language: answers to questions, automatic generation of sentences, detection of “felt”, automatic summarization and simplification or even automatic translation. Most of the current models were generally designed by giants of new technologies, trained only with texts written in English and according to principles and methods difficult to reproduce in the smallest details. For example, it is not possible to know if the answer is the result of an arithmetic operation or if the answer has already appeared in its training databases, when the model answers a question.

The BigScience project was launched in the spring of 2021 by the Franco-American artificial intelligence company Hugging Face, to tackle these problems by training a new model: Bloom. It learns from a large number of texts, according to a simple principle, which consists in predicting sentence completion word by word. Each prediction of the model is compared to the correct word, which makes it possible to adjust the internal parameters of the model. In Bloom’s case, learning is done by evaluating trillions of words, resulting in a model with 176 billion parameters. This learning took several months, and required hundreds of GPUs operating in parallel, the equivalent of 5 million hours of calculation. Such computing power can only be obtained on supercomputers like Jean Zay’s machine. Thomas Wolfe, co-founder and scientific director of the startup Hugging Face explains:

“The creation of Bloom’s model and the success of the research collaboration at BigScience show that there is yet another way to create, study and share innovations in artificial intelligence, by bringing together industrialists, academics and associations around an international, interdisciplinary and innovative project. Free access. I am happy that Hugging Face was able to find the necessary support in France for this unprecedented approach on a global scale.

The Bloom differs from other linguistic paradigms in that it is formed simultaneously in 46 languages, distributed in sources as diverse as literature, scientific articles or sports reports and includes many languages ​​rarely taken into account, in particular about twenty of African languages. The learning kit even contains computer code! All worth several million pounds. However, the greater the variety of approaches and sources, the greater the model’s ability to perform various tasks. The data is also not categorized according to their language because, paradoxically, the Bloom learns better this way. Aggregating content in different languages ​​enables powerful and efficient training of models for all languages ​​considered, and often yields better results than monolingual models. Another feature: Bloom’s architecture, the list of data used and its learning history will be fully available on open flag, to facilitate the search for language models. Finally, Bloom is distributed free of charge using a responsible file license that expressly prohibits malicious use of the form.

Languages ​​used in Bloom training.
The “Indian family” includes around fifteen languages ​​from the Indian subcontinent (Hindi, Tamil, Urdu, etc.) and the “Niger-Congolese family” around twenty languages ​​from sub-Saharan Africa (Swahili, Yoruba, Wolof, etc. ). 10.8% of the data consisted of computer code, in 13 different languages.
Source: Hugging Face

Antoine Petit, President and CEO of the National Committee for Scientific Research adds:

We are delighted with this original public-private partnership, which demonstrates how much the integration of skills and resources – such as the power of the Jean Zay supercomputer – is necessary to meet such an important and modern challenge as research in artificial intelligence. Behind Scientific Advances, we salute the participation of the Idris crew who made this supercomputer training possible, and salute the essential role played by the CNRS in mobilizing the entire ALP community. »

About Mariel Baker

Check Also

How they could drastically increase energy efficiency

Traditionally, “quantum supremacy” is sought from the point of view of raw computing power: we …