
Air cooling hasn’t been unobtrusively getting worse: it’s been hitting a physical wall. The heat generated by the processors running AI-training workloads and HPC clusters today is in such a locally concentrated area that no amount of fans or cold aisle is realistically going to be able to handle it, and the amount of performance that air can deliver relative to what all but the most exotic modern silicon demands is only getting worse with every generation of hardware.
The physics problem that marketing can’t solve
Heat transfer is not something you can decide on a whim. It strictly adheres to the laws of thermodynamics, and thermodynamics is not open to compromise.
For instance, the specific heat capacity of air is approximately 1.005 kJ/kg·K, while for water it’s 4.18 kJ/kg·K. These aren’t trivial details – they indicate that water can absorb more than four times the heat per unit of mass for the same temperature increase. If you consider density in the equation, water’s volumetric heat capacity is around 3,400 times greater than that of air under standard conditions. Thermal conductivity provides a similar comparison: water transmits heat about 24 times more effectively than air.
Practically speaking, this translates to the necessity of moving exorbitant amounts of air to cool a high-density server. The fans in a server use a notable portion of the server’s total power, essentially having to battle against its own aerodynamic drag to move air. Multiply that to cover a full rack, then a whole room of racks, and the energy cost becomes nontrivial long before you even begin including the costs of the Computer Room Air Conditioner constantly running in overcool mode to ensure all of that airflow is cold.
That’s the part that gets lost when the cooler heads of DC operators point out that cooling as a percentage of IT load has actually stayed relatively flat over the past decade. The thermal load played by the server fan itself is part of the problem, not just the room’s CRAC unit.
Where air cooling actually breaks down
Traditional air-cooled data centers can function well under optimal conditions with infrastructure, such as raised floors, precision cooling units, and hot aisle containment, in the 15 kW to 20 kW per rack range, though tending to the higher end of that range. The cabling and pathways and airflow management inside the cabinets have to be impeccable as well – are they ever not? Rack densities also have other operational impacts in terms of managing things like cooling inrush, diversity, and other factors.
However, for a number of years now, the ceiling on recommended rack densities for liquid-cooled systems has been much higher than that recommended ceiling for air-cooled systems. Liquid cooling has always been devilishly efficient at pulling the heat out of the processors – the key part is taking the second step. Once you’ve pulled it away, how do you get rid of it? Liquid coolants as a medium just have a number of innate advantages when you’re trying to manage that.
Direct-to-chip and immersion: two distinct architectures
Liquid cooling comes in more than one form. We consider the pros and cons of implementing direct-to-chip (cold plate) cooling versus immersion cooling and where in your data center each can best serve you.
The first, and in truth, more straightforward approach to liquid cooling is direct-to-chip, or cold plate cooling. The most common of today’s liquid cooling designs, it features metal cold plates mounted directly to the processor and other high heat components inside the server. Coolant is pumped from a Coolant Distribution Unit (CDU) into the cold plates where it comes into direct contact with the heat generating component and absorbs the heat with maximum efficiency. The heated liquid is then pumped out of the server to a heat exchanger where the heat is rejected to a secondary cooling loop, typically a chilled water loop or dry cooler.
The server chassis remains a standard server chassis yielding a relatively straightforward retrofit into existing rack designs with minimal impact on the mechanical infrastructure. Operators looking to implement these architectures at scale should work with experienced providers of liquid cooling data centre solutions to navigate the mechanical, electrical, and operational integration requirements before committing to a particular approach.
Immersion cooling takes a fundamentally different approach. Hardware is submerged directly in a non-conductive dielectric fluid. Single-phase immersion uses a fluid that remains liquid throughout, with heat exchangers rejecting the thermal load without phase change. Two-phase immersion uses a fluid with a low boiling point; the fluid vaporizes at the component surface and condenses on cooled coils above the bath, creating a passive convection loop with no pumping required at the component level.
The advantage of liquid cooling is that it can be applied in most existing data centers planning for progressive refreshes of IT equipment, typically in the form of rack-based, closed-coupled liquid-to-the-chip offerings of which there are many, especially in the HPC and AI sectors.
Two-phase liquid cooling is more efficient than chilled air for removing heat, and therefore minimizes the energy and the water required for cooling. However, liquid-cooled components are mechanical and electrical solutions that demand more operational scrutiny than traditional air-cooled hardware.
The retrofitting path for existing facilities
Many data center operators do not get to begin from new builds, they have existing raised-floor or slab-floor rooms supporting air-cooled racks that constitute a living capital investment, and operational recertifying constraints. The question for these operators is how to migrate incrementally to liquid cooling, while refreshing or expanding, without having to engage in a complete rebuild.
Coolant Distribution Units are the key technological innovation that facilitates such a hybrid approach to liquid cooling. A CDU handles the liquid-to-liquid interface between the facility-side coolant loop where the coolant is rejected to the environment, typically using building water/cooling towers, and the server-level cold plates. This interface isolation allows facility and IT thermal hydraulics to be engineered and operated independently.
For the first time, every leading OEM has liquid-cooled products, mostly based on the open industry-standard design and business model. This open standardisation is the other key reason for the Hybrid Liquid Cooling approach to becoming ‘no-regrets’. The liquid-to-liquid CDU allows any rack to be installed and operate with facility side coolant, even in the absence of liquid cooling at the rack level, by simply not connecting up the two sides of the CDU.
This means the incremental approach to liquid cooling during the current planning period is straightforward. New high-density racks do get direct-to-chip cooling from day 1. Racks that are either too air-cooling effective or too low density to justify liquid cooling on their own either wait their turn until the next planning cycle or never get liquid cooling – the CDU is compatible with either existing in-house designed or outsourced racks that simply follow the 19-inch standard for mounting holes.
The water consumption argument inverted
One of the main reasons people raise concerns over the implementation of liquid cooling is due to the use of water. Although we are aware of the issues with water scarcity, we must mention that this fact isn’t enough to draw a conclusion.
Air-cooled data centers use large amounts of water with cooling towers that use evaporation to expel heat. These open-loop systems depend on the evaporation of water as a primary way to reject heat into the atmosphere, and some facilities may use millions of liters of water per year in this manner, which water exits the facility as vapor.
When we talk about closed-loop liquid cooling systems, the cooling liquid (which can be either a water-glycol mix or a dielectric fluid) remains contained in a closed loop. Consequently, water is only used for the facility’s heat rejection, and new designs of dry coolers and adiabatic coolers can make the water’s evaporation almost null.
In general terms, a closed-loop liquid-cooled facility can have a lower overall water consumption level compared to an air-cooled facility with the same IT load.
Regulatory direction and the PUE reckoning
Data centers have relied on Power Usage Effectiveness as the primary efficiency metric for roughly the last dozen years. PUE is a simple concept: just divide total facility power by IT equipment power. A figure of 1.0 would indicate that every watt entering the facility is delivering useful compute output, and none is wasted on cooling overhead.
Conventional air-cooled facilities generally run somewhere in the 1.4 to 1.8 PUE range. Anything over 1.0 is a cooling and power distribution loss over the IT load. Well-tuned sites have pressed this down into the low 1.3 range, but the laws of physics around air distribution make further gains mathematically more difficult and physically expensive.
Liquid cooling changes the game on what is achievable. Direct-to-chip projects deploying efficient heat rejection infrastructure are achieving PUEs below 1.2. Immersion cooling installations have generated PUEs between 1.03 and 1.05 in climates where ambient air temperatures are mostly below the requirements of IT cooling.
Regulatory frameworks are starting to codify this. The EU Energy Efficiency Directive includes obligations that, in practical terms, mandate both improved PUE performance and in some geographies even heat reuse – where waste heat helps provide the energy to heat homes or businesses. Liquid cooling delivers heat at a temperature that is more economically captured for reuse. Warming air is pretty cheap.
What this all means is that if you aren’t already planning for liquid cooling in your strategic roadmap, you’re now making a regulatory compliance decision, not just a technical one.
Hardware lifespan and the thermal cycling problem
Liquid cooling has a reliability argument that perhaps flies a bit under the radar compared to the energy story or the performance story. But it shouldn’t. We’ve data showing a liquid-cooled system is at least an order of magnitude more reliable over a three- to five-year lifespan of a piece of hardware than an air-cooled system. And it’s easy to understand why. Thermal stress is a killer.
Certainly, temperature is a contributing factor in failure rates. Higher temperature in general means more failures. But when you look specifically at what’s happening at the hardware level when it’s failing, often that failure is driven by thermal cycling. As the machine cycles up and down, as you do your HPC run for eight, 10, 14 hours, then it’s cooling off, as the fan’s speeding up and slowing down, as the ambient conditions fluctuate through the day – all of that is mechanical stress to the solder joints, the substrate materials, the interfaces between components. It leads to long-term hardware failure.
Liquid cooling allows you to maintain a much more consistent temperature of all of the components. The cold plates for direct to chip give you a very even flow distribution so that you’re not overly stressing one core versus another. But more broadly, the processor package is not operating at 5 or 10 or 100 watts hotter because the ambient temperature is just a little bit different or because you’re having to push the fans harder to accommodate the cooling of this or that memory bank or on-chip temperature sensor because the cooling is really well distributed.
So not only do you get cooler components – and you reduce the number of hotspots, which is one of the main contributors to thermal stress in terms of failure – but more equal temperature across the package, as you’re not thermally cycling this core because it’s right near the path of the fan and not this other one.
















