The Daniel K. Inouye Solar Telescope in Hawaii. Credit: DKIST/NSO/AURA/NSF
Jarrett Haley • May 15, 2025
Big data can be transformative, but it can also be challenging to manage. How do researchers access, share, and work with datasets that measure in the terabytes or even petabytes? Curt Dodds and colleagues at the University of Hawaii's Institute for Astronomy are now tackling this challenge with the power of the National Science Foundation (NSF)-funded National Data Platform (NDP).
Dodds leads a team of IT engineers who support the Institute, which regularly collects enormous amounts of data from telescopes and other instruments on the islands. His interest in facilitating open access to data, especially to advance AI and machine learning applications, led him to the communities surrounding the NDP, the National Research Platform (NRP), as well as the Open Science Data Federation (OSDF) and its software layer, Pelican, all of which are also NSF-funded.
“It’s like Netflix for science data,” says Dodds, describing the distributed system to which these platforms all contribute. “The goal is no friction when you want to access data. You can find the data you want through the catalog, and you can use it without having to download these huge files locally. You work with that data on the cloud, wherever the best computing resources are available.”
Data gets big quickly when you’re working with the biggest thing in our solar system. Dodds is a member of the Critical Early DKIST Science: Spectropolarimetric Inversion in Four Dimensions with Deep Learning (SPIn4D, NSF#2008344) research group, developing AI tools to understand the solar atmosphere. This involves measuring the photonic activity of the Sun to determine its atmospheric properties, such as magnetic field, temperature and pressure. Accurate models of the sun's atmosphere can improve the prediction of space weather, potentially mitigating the impact of solar events that disrupt power grids, satellite function, and cellular communications. The largest solar event in recorded history, the Carrington event of 1859, caused major disruptions to even the limited technology of that time. A similar event in the modern age would cause widespread disturbances to systems now vital to our way of life.
Solar storm with plasma ejection captured in ultraviolet. Credit: DKIST/NSO
Through 2021-2023, the SPIn4D team modeled and ran solar simulations using 10 million CPU hours on the NSF’s Cheyenne supercomputer, producing a massive dataset of 110 terabytes. Typically such large datasets will overwhelm local processing power, and researchers must work with small subsets of data in order to accommodate the computing resources at their disposal.
But in the age of AI, using more data can produce more accurate models—provided that these hurdles of access and computing power can be overcome. If machine learning systems are able to efficiently access large datasets and utilize available processing power regardless of physical location, the scale of science can expand tremendously, unencumbered by the limitations of local resources.
The National Data Platform is designed to increase access and usage of such large datasets, enabling broader collaboration and sharing of resources for scientific research. With the OSDF as a content distribution system and the NRP providing processing power, the platforms work together to make datasets discoverable and available to use remotely on the cloud. Once the data is discovered and moved to an NDP workspace by a user, NDP endpoint services use the OSDF network to move the desired data to an NRP node or another NDP endpoint where an appropriate supercomputing or cloud resource is located. This distributed system coordinated via the NDP endpoints is vital for overcoming geographical limitations and optimizing both data transfer and computational resources.
“The value I see in NDP and OSDF is facilitating others’ use of our data, and vice versa,” says Dodds. “Our telescopes here in Hawaii collect very unique data, but how do we get 100 terabytes of important data from Hawaii to a researcher in Chicago? We need to send and receive data both ways and be able to work with it together—traditional methods don't do that efficiently.”
To serve as an example for how large datasets can be handled, the SPIn4D team published a 13 terabyte dataset of their solar simulations to the NDP. Dodds has also contributed a Jupyter notebook that outlines how to explore the data, and hopes to use NDP to deliver online, interactive course content using this data to introduce students to solar spectropolarimetry. The team also plans to release the deep learning models they use as a tool for others to use.
Solar storm with plasma ejection captured in ultraviolet. Credit: DKIST/NSO
Applications of this efficient approach to big data are not at all limited to the Sun. Dodds has worked on using AI with satellite-based observations of the Earth, specifically monitoring deforestation rates in the Amazon rainforest. That project used radar data to see through the perpetual rain clouds of the region, and AI was applied to generate visual simulations by training pairs of radar and camera data together. The translation is similar to how the solar atmosphere is modeled from photonic measurements. This data is also available via the OSDF.
Whatever the application or end product, the use of AI in scientific research can be deeply accelerated when the NDP, OSDF and NRP work together to provide unprecedented access and optimized computing to researchers throughout the nation. With the National Data Platform managing the user workflow and digital infrastructure, scientists can focus on discovery and apply machine learning to accelerate both analysis and action.
Contact - ndp@sdsc.edu
The National Data Platform was funded by NSF 2333609 under CI, CISE Research Resources programs. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funders.