World’s fastest performance for the number of deep learning models trained per time unit for CosmoFlow a key machine learning processing benchmark
Fujitsu and RIKEN announced that the supercomputer Fugaku took the first place for the CosmoFlow training application benchmark, one of the key MLPerf HPC benchmarks for large-scale machine learning processing tasks requiring capabilities of a supercomputer. Fujitsu and RIKEN leveraged approximately half of Fugaku’s resources to achieve this result, demonstrating the world’s fastest performance in this key benchmark.
MLPerf HPC measures how many deep learning models can be trained per time unit (throughput performance,). Software technology that further refines Fugaku’s parallel processing performance has achieved a processing speed approximately 1.77 times faster than that of other systems, demonstrating the world’s highest level of performance in the field of large-scale scientific and technological calculations using machine learning.
These results were announced as MLPerf HPC version 1.0 on November 17th (November 18th Japan time) at the SC21 High-Performance Computing Conference, which is currently being held as a hybrid event.
Fugaku Claims World’s Highest Level of Performance in the Field of Large-scale Scientific and Technological Calculations Using Machine Learning
MLPerf HPC is a performance competition composed of three separate benchmark programs: CosmoFlow, which predicts cosmological parameters, one of the indicators used in the study of the evolution and structure of the universe, DeepCAM, which identifies abnormal weather phenomena, and Open Catalyst, which estimates how molecules react on the catalyst surface.
For CosmoFlow, Fujitsu and RIKEN used approximately half of the Fugaku system’s entire computing resources to train multiple deep learning models to a certain degree of prediction accuracy and measured from the start time of the model that started the training to the end time of the model that completed the training last to evaluate throughput performance. To further enhance the parallel processing performance of Fugaku, Fujitsu and RIKEN applied technology to programs used on the system that reduce the mutual interference of communication between CPUs, which occurs when multiple learning models are processed in parallel, and also optimize the amount of data communication between CPU and storage. As a result, the system trained 637 deep learning models in 8 hours and 16 minutes, a rate of about 1.29 deep learning models per minute.
The measured value of Fugaku claimed first place amongst all the systems for the CosmoFlow training application benchmark category, demonstrating performance at rates approximately 1.77 times faster than other systems. This result revealed that Fugaku has the world’s highest level of performance in the field of large-scale scientific and technological calculations using machine learning.
Going forward, Fujitsu and RIKEN will make software stacks such as libraries and AI frameworks available to the public that accelerate large-scale machine learning processing developed for this measurement. Widely sharing the knowledge of large-scale machine learning processing using supercomputers gained through this exercise will allow users to leverage world-leading systems for the analysis of simulation results, leading to potential new discoveries in astrophysics and other scientific and technological fields. These resources will also be applied to other large-scale machine learning calculations, such as natural language processing models used in machine translation services, to accelerate technological innovation and contribute to solving societal and scientific problems.