Domain-Specific Compression Using Auto-Encoders
Scientific data comes in all shapes and forms, but many computational sciences rely on gridded or structured data. This is for example the case, for the simulation of various physical phenomena as applied in numerical weather prediction, the climate sciences, and many other fluid-dynamic-based engineering disciplines. Similarly, many instruments use sensors arranged as arrays giving rise to 2D and 3D regular grids very similar to images from a camera. But while sophisticated data reduction strategies and compression algorithms optimized for photography are widely adopted, special purpose domain-specific compression schemas are rare due to the involved engineering effort, yet they are highly sought.
Recent advancements in machine learning, as well as new affordable acceleration hardware, allow considering tailor-made compression by a broader audience of users. More specifically, scientific data often feature local continuity and other structural characteristics which allow finding very compact representations. Coincidentally, the machine learning community has found several approaches that allow to automatically come up with such representations, for example, using so-called auto-encoders. An auto-encoder is challenged to reconstruct an input while granted only a limited capacity (significantly less than the input size) to "memorize" what is essential to perform the reconstruction by learning two different neural networks: an encoder and a decoder network.
As part of this thesis you will implement an auto-encoder architecture to compress real-world (climate or other scientific data) as well as synthetic (for example, random, predictable) data to systematically evaluate various performance characteristics such as compression ratio, information loss, and compression/decompression speed in comparison to established compression algorithms. In this thesis project, you will learn about best practices in setting up machine learning tasks using state of the art frameworks such as PyTorch/Keras/Tensorflow while working with actual scientific data (for example, as used in climate research).