Artificial intelligence is driving significant changes in how public sector entities and academia address everything from cybersecurity to healthcare to disaster response. Predictive analytics for fleet maintenance, global supply chain automation, some of the world’s most complicated transportation logistics, battlefield readiness for every conceivable combat environment, accelerated vaccine research and development (just see covid) – if you can imagine an aspect of computing engaged in by the federal government, chances are it will be touched by AI, if not completely transformed.
However promising these advancements in AI and machine learning (AI+ML), edge computing, and HPC may be, in order for their full potential to be realized, changes are also required down to the level of infrastructure that supports these functionalities. Traditional data center infrastructures based on fixed, inflexible server architectures, simply cannot perform at the level these applications require, with performance and capacity essentially fixed at the point of sale. But with workloads that require real-time responsiveness from hardware resources, static architectures hit a wall. Resources cannot be pooled and the ability to scale those resources via software is severely limited.
New levels of utilization and operational efficiency can drive down the cost of IT modernization while providing the kind of dynamic performance required across compute scenarios. IT organizations in general see that what may have seemed like the future just a couple of years ago is urgently right now. The industry as a whole is stepping up to develop a computing ecosystem that enables public (and private) entities to keep up with the promise of AI-driven transformation.
Rapid improvements in GPU performance, as well as new strategies for maximizing efficiencies in deploying persistent memory, NVMe storage, FPGA, and other accelerators are helping to address ubiquitous data demand in the AI-everywhere era. Breakthroughs like the new Compute Express Link (CXL) protocol, efficiencies introduced by software virtualization and containerization, and new approaches to data center architecture like composable disaggregated infrastructure (CDI) provide software-defined ways to manage hardware up and down the software stack and at the bare metal level.
For example, dynamic GPU performance is essential to the construction of cutting-edge visual augmentation systems like those being pioneered by the U.S. Army. Such systems provide 360-degree situational awareness tools that gather data from high-resolution sensors in the field in order to provide tactical and strategic advantage for combat troops. Conditions on the battlefield can change second-to-second, and GPU resources must be regularly reconfigured to provide actionable visual intelligence in the fog of war. That is impossible with traditional hardware solutions that must be physically reconfigured.
To solve the problem, software-defined solutions like CDI are being deployed. Software can pool and deploy GPU in tandem with other accelerators in the exact amounts required across smart fiber and intelligent networks, then be readjusted as requirements change in the moment.
This strategy not only enables previously ‘impossible’ configurations of accelerator technologies to scale for unprecedented real-time performance; it also maximizes hardware utilization for the smallest possible physical footprint. The resulting performance/efficiency ratio enables IT teams to provide real-time operational intelligence to troops in active combat in the smallest possible amount of hardware for optimized mobility and power usage. For harsh environments in which soldiers can be constantly on the move, that level of portability and infrastructure flexibility has the potential to save lives.
Innovation in next-gen compute architectures is by no means limited to military applications. The Texas Advanced Computing Center (TACC), located at the University of Austin, designs and operates some of the world's most powerful computing resources. The institution is in the process of designing and deploying its Horizon HPC system, which is funded entirely by the National Science Foundation (NSF) to support its compute efforts across domains.
Built into the core of the Horizon project is a mandate to come up with a more sustainable compute model than the competitive refresh cycles that sees one system abandoned for fully new hardware every few years in order to reap the performance rewards. The system must also perform 10x faster than TACC’s current HPC production environment.
To begin, the team solicited the IT community for the applications they believe will be core to scientific computing in the next several years, and they are designing systems that can support those applications.
A wide variety of hardware and software resources are under consideration for their ability to provide a platform that can deliver on the performance requirements while delivering a much more sustainable, flexible, disaggregated growth model that can incorporate vendor-neutral technologies as they emerge without having to refresh the system as a whole in order to reap evolving performance rewards. AI not only runs on the platform, but also improves efficiencies at the bare-metal level by observing data over time and making adjustments to optimize operations on an ongoing basis.
Texas A&M University is in the process of deploying its Accelerating Computing for Emerging Sciences (ACES) supercomputer for Phase One use by researchers. The system combines open source software tools capable of managing SLURM and Kubernetes scheduling applications, on a foundation of composable software and fabrics.
The system will provide the resource flexibility to significantly advance research and time-to-value for a wide variety of critical AI-driven use cases, including climate and seismic simulation, genomic study, natural language processing, molecular dynamics, and other compute-intensive activities that are important to academia and government. The high-performance, highly flexible architecture enables users to match resources with workloads dynamically with top utilization. Because the infrastructure allows for disaggregated footprint expansion.
As AI has gone from sci-fi to table stakes, it is forcing new levels of utilization and operational efficiency across compute scenarios. A software-defined, flexible IT ecosystem of higher-order applications on top of adaptive infrastructure and disaggregated hardware is quickly emerging to support evolving requirements for data-hungry, next-gen high-performance computing. The new, more sustainable infrastructure solutions help to ensure operations are mission-ready with software that facilitates adaptive computing across a wide variety of compute environments that can also grow seamlessly over time.
Immediately and, perhaps more importantly, over time, IT pros in academia and government can better leverage new and existing investments, reduce power and cooling requirements, and scale accelerator resources independently for better overall infrastructure utilization. IT can unlock cloud-like datacenter agility at any scale and experience new levels of capital and operational efficiency.
As demand for scalable, agile compute continues to increases with AI’s proliferation, such software-defined systems provide the flexibility to meet unpredictable workload requirements while also realizing a more sustainable growth model that can keep up with demand while driving innovation and maintaining global leadership in public sector computing.