October 17, 2024

Generative AI in Italy

Insights from our latest Masterclass

Generative artificial intelligence has captured the attention of many in recent years, thanks to its ability to surprise and progress at an impressive pace. In our latest Masterclass "Generative AI in Italy: New Models and Benchmark Evaluations," with our experts Enrico Bertino, Aldo Cocco, and Matteo Muffo, we explored the milestones that have led to the development of modern language models, focusing particularly on advancements in Italy.

What is Generative Artificial Intelligence?

Generative AI refers to the use of AI to create new content, such as text, images, audio, or video. In recent years, we've witnessed incredible advances in each of these areas. For instance, in image generation, we've evolved from early models like DALL-E and CLIP, which could generate simple images like "a chair made of avocado," to more advanced models like Stable Diffusion and Midjourney, which generate stunning images and videos that make it difficult to distinguish between reality and fiction.

In this article, we will focus on language models, which are models that generate text based on an input, commonly referred to as a prompt.

Milestones in Language Model Development

GPT-3: The Beginning of a New Era

Released in 2020 by OpenAI, GPT-3 marked a turning point in the field of language models. With 175 billion parameters, it was one of the first models to demonstrate surprising capabilities in zero-shot and few-shot learning, meaning the ability to perform tasks with no or very few examples.

ChatGPT and Ethical Alignment

Later, with the introduction of ChatGPT, based on GPT-3.5, more emphasis was placed on aligning the models with ethical values and providing coherent responses, thanks to Reinforcement Learning from Human Feedback (RLHF). This allowed the models to provide answers that were accurate and ethically appropriate.

GPT-4 and Multimodal Training

In 2023, OpenAI released GPT-4, introducing multimodal training. This enabled the models to handle text and images, further expanding their capabilities and making them more versatile. GPT-4 demonstrated notable improvements in reasoning, problem-solving, and contextual understanding.

o1: A New Chapter

Recently, the o1 model was introduced, representing a new era in language model development. o1 shifted focus from the complexity of training to the generation phase, emphasizing reasoning and the ability to outline detailed logical steps. This allows the model to solve complex logic problems, approaching or even surpassing human abilities in specific domains.

Language Models and Language Handling

A crucial aspect of language models is how they manage different languages. These models are trained on vast amounts of text data, but the majority of that data is in English—often more than 90%. This creates significant challenges for less represented languages like Italian.

Tokenization and Language Challenges

Tokenization is the process where text is broken down into "tokens," units of information that can be entire words or parts of them. Common words in training data, typically in English, are represented with fewer tokens, making the model more efficient in that language. In contrast, Italian words often require more tokens, reducing the model’s efficiency and accuracy in our language.

For example, an English text might be tokenized into 342 tokens, while the same text translated into Italian might require 448. This impacts computational efficiency and can also affect costs when using services based on the number of tokens processed.

Language Models in Italy

Despite the challenges, the Italian community has started developing language models specifically for our language. Here are some of the most significant projects.

Minerva

Minerva is a model developed by the University of La Sapienza in Rome, which starts from pre-trained multilingual models and continues training on Italian data. This approach aims to combine the benefits of multilingual models with specialization in our language. Trained entirely in Italy with a dataset of over 500 billion words, it uses the Leonardo supercomputer from Cineca for hardware.

Modello Italia

Developed by iGenius and Cineca, a consortium of 70 Italian universities, Modello Italia is a large language model designed for automating public administration in Italy and Europe. Like Minerva, Modello Italia also leverages the Leonardo supercomputer.

Multilingual Models vs. Vertical Italian Models

A fundamental question is whether it’s better to use powerful multilingual models or vertical models specific to Italian.

Model Evaluation

To answer, it’s essential to evaluate model performance. Evaluating language models is complex and can follow different approaches.

LLM as Judge

Using a language model to evaluate another model’s responses. This method can introduce bias, especially if the evaluating model is not more powerful than the one being evaluated.

Direct Inspection

Involving human experts in evaluating responses. This approach is accurate but not scalable for large amounts of data.

Static Benchmarks

Using predefined datasets to test performance. However, these can quickly become obsolete and may not reflect the latest capabilities of the models.

Chatbot Arena

An interesting project is Chatbot Arena, developed by researchers from Stanford and Berkeley. This platform allows users to input a prompt and compare the responses of two models anonymously. Users vote on the best response, contributing to a dynamic and continuously updated ranking.

The Results for the Italian Language

Which are the best language models for the Italian language?

To answer this question, several language models were tested with some Italian tasks from the Invalsi tests. The results show that the differences between models mainly depend on their size and the data used for training. At the top of the ranking is Claude 3.5 Sonnet by Anthropic, which stands out with a score of 92.2, surpassing OpenAI’s models. Following closely are Claude 3 Opus and Mistral-Large-Instruct-2407 by MistralAI. Unfortunately, the first Italian model in the ranking, Llamantino, scores 56.6, highlighting a significant gap compared to the leaders.

Try indigo.ai, in a few minutes, without installing anything.
Try a demo

Challenges and Future Prospects

Current Challenges

The performance gap is due to various factors. First, the amount of training data plays a crucial role. Multilingual models are trained on enormous amounts of data, often using terabytes of text from various sources, allowing them to acquire a deeper and broader understanding of language.

Another critical aspect is model architecture. Models like GPT-4 have more advanced architectures and larger sizes, with hundreds of billions of parameters. This complexity allows them to capture linguistic nuances and complex semantic structures that smaller, less advanced models cannot replicate.

Moreover, transfer learning significantly contributes to the performance of multilingual models. The ability to transfer knowledge from one language to another benefits these models, especially in tasks with common semantic structures. This means that learning in one language can improve performance in another, expanding the model’s capabilities without the need for specific data for each language.

The Role of Synthetic Data

One potential solution to bridge the gap is using synthetic data generated artificially. This approach could increase the amount of training data available for Italian, leveraging existing models to create new datasets.

Data augmentation techniques, such as expanding existing datasets through paraphrasing, back-translation, and controlled generation, can enrich the Italian data corpus.

Another strategy is self-training, which involves using pre-trained models to annotate unlabeled data, thereby creating new examples for training. These techniques can significantly improve model performance in Italian, providing them with more data to learn from and refine their capabilities.

Collaborations and Computational Resources

To compete with international giants, it’s essential to increase public and private investments, boosting funds for research and development in AI. Academic and industrial collaborations can create synergies between universities, research institutes, and AI tech companies, accelerating progress in the field. Additionally, access to advanced computational resources is essential. Using supercomputers like Cineca's Leonardo to train large-scale models can position Italian researchers to develop cutting-edge models capable of competing effectively on a global scale.

The competition between multilingual models and vertical Italian models is still open. While multilingual models currently outperform in terms of performance, Italian research is making significant strides. Only by training models of comparable size and capacity to the international giants can we determine if vertical models in Italian will be able to compete effectively.

It is crucial to continue investing in collecting quality data, developing new architectures, and optimizing computational resources. The path of generative AI in Italy is challenging, but with everyone's contribution, we can achieve significant milestones.

FAQ

What are the challenges in training language models in Italian?

The main challenges include the limited availability of training data in Italian, as most of it is in English. This leads to inefficient tokenization, with Italian words often requiring more tokens than English, reducing efficiency and increasing computational costs. Additionally, limited computational resources pose a significant obstacle, as training large models requires advanced infrastructure like supercomputers, which are not always easily accessible.

What are the main language models developed in Italy, and what are their goals?

Among the most significant models developed in Italy are Minerva and Modello Italia. Minerva, developed by La Sapienza University in Rome, aims to combine the advantages of multilingual models with specialization in the Italian language, using Cineca's Leonardo supercomputer. Modello Italia, created by iGenius and Cineca, is designed to automate processes in Italian and European public administration, also leveraging the capabilities of the Leonardo supercomputer.

How are language model performances evaluated, and what methods are used?

Language model performance is evaluated through several methods. One is using a language model to evaluate another’s responses, although this can introduce bias, especially if the evaluating model is not more powerful than the evaluated one. Direct inspection involves human experts for an accurate evaluation but is not scalable for large amounts of data. Static benchmarks use predefined datasets to test performance, but these can quickly become obsolete. Platforms like Chatbot Arena allow dynamic comparisons between models through user feedback, offering an updated ranking of performances.

Don't take our word for it
Try indigo.ai, in a few minutes, without installing anything and see if it lives up to our promises
Try a Demo
Non crederci sulla parola
This is some text inside of a div block. This is some text inside of a div block. This is some text inside of a div block. This is some text inside of a div block.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.