How national laboratories move more than a petabyte per week

Imagine what it would take to ship more than a petabyte per week from one data center to another. The question of time and cost is only an angle, it takes manual intervention to assemble and manage large transfers.

Massive data transfers have always been a challenge for national laboratories, especially since certain large-scale scientific computing issues are often spread across facilities or moved to approximate unique HPC resources. However, with the advent of the exascale era and the continued increase in data volumes, obtaining data between labs has become a priority.

Currently, the 100 Gb / s ESnet has been the backbone for the multi-terabyte data shuttle between labs over direct connections, but for larger simulations in cosmology and other data-intensive HPC areas. , even this network would be stretched. A collaboration has emerged to try to meet the petabyte per week goal, called the Petascale DTN Project, led by the Lawrence Berkley, Argonne and Oak Ridge National Laboratories and the National Center for Supercomputing Applications (NCSA).

Basically, collaborators say moving a petabyte per week requires about 13.2 Gbps of throughput for a week. The starting point was to achieve a persistent 15 Gb / s production level without heavy manual intervention. The key to achieving this mark was through uniquely configured Data Transfer Nodes (DTNs) in larger site networks, as well as an eye on how these interface with data transfer systems. parallel files.

“For large HPC installations, it is extremely valuable to deploy a cluster of DTNs in a way that allows a single scalable data transfer tool to handle transfers between multiple DTNs in the cluster. By explicitly incorporating parallel capabilities (for example, the simultaneous use of multiple DTNs for a single task) into the design, the external interface with the storage system of the HPC installation can be extended as needed by adding DTNs. additional, without changing the underlying tool. This combination of a scalable tool and scalable architecture is essential as the HPC community moves into the Exascale era while increasing the number of research projects requiring large-scale data analysis. “

Performance between DTN clusters at the end of the project. The composition of the dataset indicates the large number of files and the variation in file sizes of the benchmark dataset.

What is particularly interesting in what the laboratories have done is that it is generalizable. The same approach can be transferred to smaller centers to enable faster and less demanding large-scale data transfers and, in addition, allow them a closer connection with larger national laboratories.

DTNs are the central point where data transfers occur (in other words, they do not move directly to / from the file system). These servers are the external interface to the file system and read / write and then move files to / from the network at high speed. DTNs have their own filesystem attached to the storage structure that keeps larger objects on the parallel file system.

When doing huge transfers with the target rate of one petabyte per week, the only manual manipulation occurs with DTNs, which run the data mover application and transfer data between facilities. All of this simplifies moving directly to the parallel file system, and no HPC compute node needs to be tuned or optimized for transfers. DTNs handle everything and can be configured per user or per workload.

“Tuning of individual components (eg, file system mount performance, WAN performance) is necessary but not sufficient. Ultimately, everything has to work together – file system, DTN, network, and data transfer tool – in a way that consistently achieves high performance for the user community, in production operation, without troubleshooting. constant ”, explains the Petascale DTN team.

It sounds simple when expressed that way, but it took a long time for labs to achieve more transparent high-speed transfer rates for large data sets. Manual tuning and configuration to meet different environments, including Ethernet to Infiniband, for example, is much easier. The real trick was to make DTNs work well with file systems, which is most of the real value of the Petascale DTN project. Many sites have different parallel file systems in addition to a mixture of Ethernet and Infiniband. This still requires manual tuning, but it’s far from the tedious job of setting up for each new transfer.

“While it is essential to have a system design capable of scaling to the levels required to fulfill the scientific mission, human time and attention are precious resources – the tools presented to users must make them more productive. , Not less. The ability to easily initiate a transfer and then allow tools to handle the transfer without direct human intervention is incredibly valuable, and is a key contributor to the scale at which scientists will be able to analyze data as the data sets are getting bigger and bigger.

More details on the Petascale DTN effort as well as references and experimental transfers can be found here.

Subscribe to our newsletter

Presenting the week’s highlights, analysis and stories straight from us to your inbox, with nothing in between.
Subscribe now

Source link

About Mariel Baker

Check Also

The LLNL team uses the Ruby supercomputer to study the effects of nuclear weapons on detonations near the surface

June 8, 2021 – A team from Lawrence Livermore National Laboratory (LLNL) took a closer …