The State of GenAI for 2025: Observations and Predictions — Part 1: Research & Innovation

Year 2023 has seen the rise of generative AI. At the time, GPT-4 dominated benchmarks, open-source was lacking behind proprietary models, and AI safety was a growing concern for mainstream. Within a year, AI has evolved a lot with an intense rhythm of important news and progress.

Through this two-part article series, we will explore the key GenAI trends of 2024 and share our predictions for the year ahead.

But first, we will focus mainly on the first part and have a throwback to the most significant advancements in terms of research and innovation.

Research & Emerging Trends

Models’ Performances

Overall models’ performances still increase
In 2024, research in artificial intelligence has seen remarkable advancements. Even though GPT-4o did not outperform GPT-4 as much as GPT-4 did outperform GPT-3.5, the progress is still very clear for other LLM families such as Llama, or VLMs gains [0]. Besides, LLMs perform way better on some tasks that have historically been difficult, such as reasoning. OpenAI’s o3 model for instance does achieve a 25% accuracy on the FrontierMath benchmark, against 3% for its predecessor o1. While OpenAI is still number one on many benchmarks, a lot of models have been released, each often surpassing the previous one or at least matching its performance with lighter models(centralized and up-to-date leaderboard available in [1]).

Performance gap between proprietary and open-source models decreases
Open-source models also had their time, especially when Meta released Llama 3.1, with its biggest version outperforming other SOTA proprietary models across a wide range of benchmarks, thus closing the gap with proprietary alternatives (see fig. 1 below). Yet, Llama models are arguably not truly open-source, as for instance their training data are not public.

*Figure 1 — Large Language Models evolution (open-source vs. proprietary)*

Gen AI Challenges Ahead

Despite these gains, the overall performance of AI models seems to be reaching a plateau and some challenges must be addressed to drive further improvements.

Maximum amount of available data almost reached
As seen in fig. 1, performance gains have been declining over time and improvements based on the amount of training data are diminishing as the available data for model training nears its limit. (see fig. 2).

*Figure 2 — Prediction table of available data exhaustion date*

The rise of AI-generated data
Another important aspect is the growing proportion of AI-generated data on the internet (blog posts, images, etc). These AI-generated data — by disturbing the true underlying data distribution (coarsely adding data not coherent with real data) — have limited impact on model efficiency and can even degrade models’ performances[2]. Consequently, basing GenAI’s performances mainly on the amount of data used tends to be less and less relevant.

The increasing importance of data quality
As seen above, the quality of the data used for training models is now more critical than ever for achieving performance gains in models. This concern has led to various researches in the field as the one with the release of a high-quality dataset from HuggingFace [3].
HuggingFace also shared in their blog spot [4] their estimation of the evolution of contaminated data in Common Crawl (a source of data scrapped from the internet) over the last years, highlighting that it increased significantly since ChatGPT release (see fig. 3).

*Figure 3 — Evolution of contaminated data in CC data according to the HuggingFace study*

Most notably, they have shown how cleaner data led to better results, by comparing the performance of language models trained on FineWeb (a high quality dataset) against those trained on other popular datasets like C4, The Pile, and RedPajama (datasets with a lower data quality).Models trained on FineWeb indeed consistently outperformed their counterparts across various benchmark tasks.

*Figure 4 — Qualitative data leads to better model performance.*

With less new data available to train models and the remaining improvement margin in terms of data quality, this latter is probably going to be a focus for building efficient language models within the next year.

Increased Efforts on Efficiency, Models’ Size & Inference Speed

Efficiency and models’ size are new major goals
Accuracy is the main metric on most benchmarks but new aspects such as efficiency and model size are now more and more prominent goals. These metrics are closely monitoried as they indicate models’ effectiveness at scale and their capacity to be deployable in resource-limited environments leading to better privacy, on-device inference, sobriety and cost-efficiency.

Notably, a lot of progress has been made in this direction this past year as some models or families of models were designed to be small yet efficient such as HuggingFace’s SmolLM, Microsoft’s Phi-3.5, and Google’s Gemma family.

The growing interest in model quantization
Another very important aspect of resource efficiency and size is model quantization. This technique reduces the size of models, yielding to a more efficient computation without significant performance trade-offs.

Even if quantization was adopted quite early in LLMs [5], it remains an active area of research with some promising advancements done on that topic in 2024. Notably, Microsoft has explored the potential of 1-bit quantization (1.58 actually, with ternary parameters), pushing the technique to a new level and, yet still managed to keep reasonable performance loss [6].

The emergence of new models’ architectures
Hybrid models, featuring a new architecture based on Markov Chains and combining large language models (LLMs) with State-Space models (SSMs) — dynamic systems using state variables and equations to model their evolution over time — are emerging as a solution to achieve state-of-the-art performance with higher throughput efficiency.

Jamba 1.5 is an example of such a model, claiming to be 2.5x faster and having longer context capacities than leading competitors, while maintaining similar performance [7].

Developing Context-Awareness & Reasoning

In addition to model size and efficiency, methods like agentic workflows and Retrieval-Augmented Generation (RAG) are also in expansion allowing increased performance and larger possible use-cases.

Agentic workflows
Agentic AI refers to systems composed of models able to interact together or with third-party products. As these interactions between models can be very diverse such as planning actions, reviewing outputs, simplifying inputs or setting new goals, it is now possible to create complex chain-of-thoughts interacting with multiple systems, hence developing reasoning.

Outstanding applications have been presented during the year such as coding a video game only with AI agents for less than one dollar [8] highlighting the potential of this new paradigm. Supporting this new trend, software producers are increasingly integrating products with GenAI models. One striking example is the Agentforce from Salesforce: a fully-integrated GenAI agent within Salesforce’s CRM system [9].

It is expected to be a major trend for 2025 as it brings action and reasoning capacities on the table, serving as a true enabler of automation.

Retrieval Augmented Generation (RAG)
By indexing internal data to make them available for a LLM, RAG enables context-aware model inference making model’s answers specific to the context. This method addresses one of the biggest weaknesses of LLMs when it comes to their use in companies. That’s why LLMs with RAG have become one of the most widely adopted use cases in companies, further boosting the integration of LLMs in professional environments. Perplexity is an example of start-ups leveraging those two techniques for online search.

Multimodal Models

Another significant trend this past year was the rise of multimodal models which are systems that integrate text, image, video and audio inputs and/or outputs.

While most industrial advancements have been focused on the interaction between image and text, OpenAI’s SORA created a sensation by generating one-minute-long coherent videos from a prompt, hinting at major innovations in this area in the upcoming months. Some competitors have already emerged, like Veo 2 from Google that offers state-of-the-art generation capacities according to human evaluations [10].

China’s Place In The Race

As underlined by Nathan Benaich’s state of AI report [11], China has also made notable strides in AI, particularly with language models. Despite sanctions and restrictions on access to cutting-edge hardware, Chinese AI research continues to make gains, particularly in NLP and model development, challenging the dominance of Western-developed models.

Recently, Qwen 2.5-Coder outranked GPT-4o and o1-preview on Livebench coding category, and Qwen2-VL-72B is currently leading the openVLM leaderboard, showing how competitive Chinese models can be.

A Broader Scientific Perspective

AI for fundamental research
In the broader scientific and technical landscape, AI is making substantial contributions beyond classical areas such as NLP or image processing. Indeed, a 2024 study on AI’s impact on scientific discovery revealed that AI-assisted researchers discovered 44% more materials, and that it led to a 39% increase in patent fillings [12, 13]. This would also increase downstream innovation, +17% according to that same study.

Another good example is DeepMind’s FunSearch, “a method to search for new solutions in mathematics and computer science” [14]. Based on LLMs , it unlocked new discoveries in mathematical sciences.

AI for robotics
Robotics, another area benefiting from AI research, is progressing rapidly. For instance, Google DeepMind team’s autoRT system uses VLM (Vision Language Model) to understand the robot environment and an LLM to suggest consecutive tasks [15]. Robotics has also become more accessible with open-source tools like HuggingFace’s LeRobot, which hosts pretrained models, datasets with human-collected demonstrations, and pre-trained demonstrations designed for real-world robotics [16].

2024 Trends Summary

The combination of these advancements — ranging from open-source accessibility and multimodal capabilities to innovations in quantization and scientific applications — reflects a year of diversified progress in AI research.

Notably, 2024 saw the first Nobel Prize awarded in AI, highlighting the field’s growing impact on scientific and societal advancement. The Nobel Prize in Physics was indeed awarded to Geoffrey Hinton and John Hopfield for their pioneering work in artificial neural networks, which laid the foundation for modern machine learning and AI technologies.

Our Predictions For 2025

Agentic Development Will Be The Hottest Topic

As seen previously, 2024 has seen some great advancements in AI agents demonstrating its feasibility and more importantly, its high relevance in many real-world contexts.

This trend will accelerate significantly in 2025 due to two main factors.

The development and improvement of multimodal models: These models being able to receive and generate multiple types of inputs and outputs will unlock the range of interactions with tools, systems or applications.
As an example, the newly released Gemini 2.0 can process and generate text, images, audio, and video. This capability enables efficient user interactions across different platforms such as Google TV where users can ask out loud about the latest movies or general topics, and Google TV responds with relevant content from sources like YouTube [17].
The development of frameworks to create seamlessly AI agents: Newly released Hugging Face‘s SmolAgents [18] highlights this trend by simplifying code agent creation with features like search tools or dynamic execution.
Even more recently, at CES 2025, Nvidia introduced agentic AI blueprints to help developers build AI agents to solve complex tasks such as searching and summarizing videos [19].
Having such a major actor releasing this at one of the most important tech events of the year highlights the importance of that area.

New criteria and tests will be used for performance evaluation

Previously, benchmarks were mainly done by computing performances on some very popular tests such MMLU for general tasks or HumanEval for coding tasks. As top models approach 90% accuracy with narrowing performance gaps [0] [1], new tests are starting to be used as new standards and new criteria will then be assessed.
One important new criterion is reasoning. The idea of creating intelligence with true reasoning is no longer just an ambitious dream, as models like OpenAI’s GPT-o1 and, more recently, GPT-o3, are making significant progress in this direction. To assess these capabilities, new tests are promoted such as ARC-AGI [20] (a test measuring the efficiency of AI skill-acquisition on unknown tasks) and GPQA Diamond benchmark [21] (a test of PhD-level science questions). In fact, these tests were the main indicators of GPT-o3 capabilities during its announcement during the 12 days of OpenAI [22].

As seen previously, agentic models are becoming increasingly important in the GenAI landscape. Consequently, assessing their agentic capabilities is a growing concern. That is why METR (Model Evaluation & Threat Research), a nonprofit organization dedicated to assessing the autonomous capabilities of advanced AI systems and potential risks, has been founded in 2023.

To address this issue, METR [23] offers tests like Autonomy Evaluation Resources, designed to assess AI systems’ ability to complete complex, multi-hour tasks without human intervention. These tests have already been used to evaluate Claude 3.5 Sonnet’s capabilities, and it is likely that such methods will become standard.

GenAI will increase the growth of robotics development:

As seen previously, multimodal models will enable a lot of new range of use cases with robotics being one of the most significantly impacted fields . In fact, such models unlock the possibility to make features increasingly human-resemblant by being able to receive multiple types of inputs (audio, images, videos) and being able to generate multiple types of outputs (movements, audio/text mainly).

On top of that, major actors such as Nvidia led some great innovations regarding GenAI for robotics. One last example is their open-source video models Cosmos [24] which enable the generation of synthetic data for robotics[25].

This trend can already be observed with robotics pioneer Boston Dynamics which has already implemented GenAI in their robots using Nvidia Isaac platform [26].

Conclusion

During this first part of our 2024 statement about GenAI, we have focused on the main innovations in this field. Within this past year full of innovation, models are not only more performant and efficient, but they are now more adaptable to various contexts with their multimodality and related techniques like RAG. This is making GenAI models even more relevant in various environments such as companies but also fundamental research.

This suggests that 2025 will be the year of agentic models, requiring new tests to evaluate their capabilities. Additionally, with the rise of multimodal models, certain fields — most notably robotics — will experience significant transformation.

Such innovations will inevitably have a profound impact on our society across different levels, including economic, political, and safety aspects. To gain a clearer understanding of the current and future impact, we invite you to read our second article dedicated to these topics.

Credits: Théophile Loiseau , Louri Charpentier

References

[0] LLM Benchmarks: Overview, limits and model comparison. (2024). Retrieved from https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison

[1] (N.d.). Retrieved from Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

[2] Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755‑759.

[3] Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., … Wolf, T. (2024b). The FineWeb datasets: Decanting the web for the finest text data at scale.

[4] FineWeb: Decanting the web for the finest text data at scale — A hugging face space by huggingfacefw. (n.d.).

[5] Zafrir, D., Boudoukh, G., Izsak, P., & Wasserblat, M. (2019). Q8BERT: Quantized 8Bit BERT. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS 2019.

[6] Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., … Wei, F. (2024). The era of 1-bit llms: All large language models are in 1.58 bits.

[7]Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., … Shoham, Y. (2024). Jamba: A hybrid transformer-mamba language model.

[8] Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior.

[9] AgentForce: Create powerful AI agents. (n.d.-b).

[10] Veo 2. (2024, December 17).

[11] Benaich, N. (2024, October 10). State of AI report.

[12] Salimbene, M. (2024, November 21). Recent AI paper cites evidence that AI positively impacts scientific R&D | Technology Law Dispatch.

[13] Aidan Toner-Rodgers (2024). Artificial Intelligence, Scientific Discovery, and Product Innovation.

[14] Alhussein Fawzi and Bernardino Romera Paredes, Ellenberg, J., & Mitrichev, — Petr. (2023). FunSearch: Making new discoveries in mathematical sciences using large language models.

[15] Shaping the future of advanced robotics. (2024, December 17).

[16] lerobot (LeRobot). (2024, July 16)

[17] Krol, J., & Schwanke, A. (2025). The next-generation of Google TV is on the way with an improved Gemini that’ll make smarter and better.

[18] Huggingface. (n.d.). Huggingface/Smolagents: 🤗 smolagents: A barebones library for agents. agents write python code to call tools and orchestrate other agents.

[19] Boitano, J. (2025). Nvidia and partners launch agentic AI blueprints to automate work for every enterprise.

[20] What is Arc-Agi? (n.d.)

[21] Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., … Bowman, S. R. (2023). GPQA: A graduate-level google-proof Q&A benchmark.

[22] OpenAI o3 breakthrough high score on Arc-AGI-pub. (n.d.-a).

[23] Metr. (n.d.).

[24] Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., & Cai, T. (2025). Cosmos world foundation model platform for physical ai.

[25] Liu, M. (2025, January 11). Cosmos World Foundation models openly available to physical AI developers | NVIDIA blog. Retrieved from https://blogs.nvidia.com/blog/cosmos-world-foundation-models/

[26] Andrews, G. (2024). Following the prompts: Generative AI Powers Smarter Robots with Nvidia Isaac platform. Retrieved from https://blogs.nvidia.com/blog/generative-ai-robotics-isaac/

Source link