Interactive Storage Layout Visualization of HDF5 Object Placement
Scientists across the world are relying on self-describing data formats such as HDF5 and NetCDF to store scientific data and exchange results. With growing data generation capabilities, as provided by high-resolution scientific instruments (think telescopes, sensors in particle accelerators, and satellites for remote sensing) as well supercomputer simulations, scientific discovery is limited by the ability to store and access large volumes of data. Unsurprisingly, data access performance is severely impacted by the structure and organization of the previously mentioned self-describing data formats. A key challenge to achieve optimal storage performance is thus to cleverly map and align these data structures to play well with the underlying technical systems. Because scientific users and supporting software libraries are utilizing a wide range of different optimization strategies, the emerging mappings from logical subsets of data to a concrete on-storage serialization are often not intuitively conceptualized. Here, visualizations from an I/O perspective can help to demystify how data is laid out in detail.
HDF5 also provides rich metadata about the in-file position and size of internal objects such as groups, datasets, and chunks of an HDF5 file. While these are sometimes consulted when optimizing I/O performance, an interactive exploration tool easily accessible to users and developers through, for example, Jupyter notebooks is missing.