Nurturing the Cycle of Innovation: The Role of HPC, Big Data, and AI Advancement

GenAI made a big splash when ChatGPT launched on November 30, 2022. Since then, the drive for more powerful models has changed the way we approach hardware, data centers, and energy usage. The development of foundational models is still moving at a fast pace. One of the challenges in high-performance computing (HPC) and technical computing is figuring out where GenAI fits and, more importantly, what it means for future discoveries.

The major strain on resources has come from developing and training large AI models. The expected inference market, which involves deploying the models, will likely require different hardware and is expected to be much larger than the training market.

When it comes to HPC, there are some big questions. For example:

How can HPC benefit from GenAI?
How does GenAI fit in with traditional HPC tools and applications?
Can GenAI write code for HPC applications?
Can GenAI understand science and technology?

The answers to these questions are still being worked on, with many organizations, including the Trillion Parameter Consortium (TPC), exploring the role of GenAI in science and engineering.

One challenge with large language models (LLMs) is that they sometimes provide incorrect or misleading answers, which are often referred to as “hallucinations.” For example, when asked a basic chemistry question like “Will Water Freeze at 27 degrees F?” the answer given was clearly wrong. If GenAI is going to be used in science and technology, the models need to be improved.

So, what about using more data? The “intelligence” of early LLMs was improved by feeding them more data, which made the models larger and required more resources. Some benchmarks suggest that these models have gotten smarter, but there’s an issue: scaling models means finding more data. The internet has already been scraped for much of the data used, but the success of LLMs has also led to an increase in AI-generated content, such as news articles, summaries, and social media posts. Estimates suggest that about 10–15% of the internet’s textual content is now AI-generated, and by 2030, AI-generated content could make up more than half of it.

However, there’s a risk with this approach. When LLMs are trained on data generated by other AI models, their performance can degrade over time, a phenomenon known as “model collapse.” This could lead to a cycle where AI-generated content is continually used as input for future models, creating a feedback loop of poor-quality data.

Recently, tools like OpenAI’s Deep Research and Google’s Gemini Deep Research have made it easier for researchers to create reports by suggesting topics and conducting research. These tools gather information from the web and generate reports, which then become part of the data used to train future models.

What about the data generated by HPC? HPC is already producing massive amounts of data. Traditional HPC focuses on crunching numbers for mathematical models, and this data is typically unique, clean, and accurate. It’s also highly tunable, meaning it can be shaped to fit different needs, and the possibilities for generating data are almost limitless.

A good example of this is Microsoft’s Aurora, a weather model based on vast amounts of meteorological and climatic data. Aurora has trained on over a million hours of data, leading to a massive increase in computational speed. Aurora’s research shows that training AI models on a diverse range of data can improve their accuracy, with datasets ranging from hundreds of terabytes to a petabyte in size.

In the realm of science and engineering, we already work with numbers, vectors, and matrices, so the goal isn’t to predict words like LLMs do, but to predict numbers using Large Quantitative Models (LQMs). Building LQMs is more complex than LLMs and requires a deep understanding of the system being modeled, access to large datasets, and sophisticated computational tools. LQMs can be used in various industries, such as life sciences, energy, and finance, to simulate different scenarios and predict outcomes more quickly than traditional models.

Data management remains a challenge. While AI model generation can be computationally feasible without GPUs, it’s nearly impossible to do without proper storage and data management. A large portion of the time spent on data science projects goes into managing and processing the data, rather than running the models themselves.

In AI, there’s a concept known as the “Virtuous Cycle,” proposed by Andrew Ng. This cycle explains how AI companies use data generated from user activity to improve their models, which then attracts more users, generating even more data. This creates a self-reinforcing loop that accelerates progress.

A similar cycle exists in scientific and technical computing, where HPC, big data, and AI work together in a feedback loop. Scientific research generates vast amounts of data, which feeds AI models that analyze patterns and make predictions. These predictions can then lead to new research and the cycle continues.

As the Virtuous Cycle accelerates, it’s driving advancements in science and technology, but there are challenges to consider. The increasing demand for data and computational power raises questions about resource sustainability. There’s also the risk that the cycle might eventually “eat its own tail,” where the increasing demand for data could overwhelm the system.

The new era of HPC will likely be built on LLMs, LQMs, and other AI tools that use data models derived from both numerical and real-world data. As the cycle accelerates, the role of Big Data and quantum computing will become even more essential for training the next generation of models.

Despite the progress, there are still questions and challenges to address. The increasing demand for resources will put pressure on sustainability solutions, and the effectiveness of the Virtuous Cycle will depend on whether the cycle can continue to generate value without self-limiting.