For problem description and initial discussion, see #3237.

Overview

This MR introduces a new parameter for the PRJ-File. It is obligatory within the optional hdf section. false: Fast write is disabled (Default) true: Fast write is enabled

<time_loop>
        <processes> ...
        </processes>
        <output>
            ...
            <hdf>
                <number_of_files>1</number_of_files>
                <fast_write>0</fast_write>
            </hdf>
        </output>
</time_loop>

Usage

The new parameter is necessary because file I/O performance and data integrity are opposed here.

Disable fast write

Choose false (Fast write disable) if:

You need to have readable result files (HDF/XDMF) in case of
- unintended simulation termination (crash) and you want to analyse output data
- intended simulation abort (SIG_INT): Hint: Stop simulation right after output. Be aware that termination in the middle of I/O process can destroy all data
You need readable data when simulation run (InSitu)
You don't care about file IO performance when simulation runs (but you need good performance in your postprocessing)

Enable fast write

Choose true (Fast write enable) if:

You need best file IO performance (simulation and postprocessing)

Implementation

HDF Files are flushed after each timestep (flush is a collective methode->synchronisation barrier). It continues when the OS overtakes the writing! (not when all data is written)

HDF-SWMR:

Main idea: We could (ab)use the data consistency model that is provided with SWMR.

Test: A test with HDF 1.12.1 on a (minimal example)[TobiasMeisel/minimal_examples!2 (closed)] was conducted. The routine spend almost all time in a writing process. By that a SIGINT was send during writing. The resulting h5-File was corrupted. It was possible to recover with h5clear -s <file_name> but all the file was empty. With adding a new datagroup (with data) it was tested if only the group/dataset that is written in the moment is effected -> but still results in an empty file.

Conclusion: SWMR is not a solution to address this problem properly. HDF is simply not suitable for "streaming", where the stream can break at anytime. If we want recoverable data, then we need a new file (at each time step).

XDMF Files are written completely new after each timestep (that`s fine because it in kB-Range -> µs)

Discussion

Option with close/open file gives the same risks as with flush (invalid data when terminated with writing procedure) Possible tools have been tested: https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Clear and no easy and always applicable solution was found. Long discussion to recover some data can be found in the HDF forum (e.g. https://forum.hdfgroup.org/t/file-state-after-flush-and-crash/3481/4) . It can not be suggested!

Most likely this MR is not the final solution, but hopefully a good first step. If it turns out that this solution is not sufficient alternatives are:

Write 1 file per time step
https://portal.hdfgroup.org/display/HDF5/Design+HDF5+-+SWMR+Functions?preview=%2F50892963%2F50892964%2FDesign-HDF5-SWMR-functions.pdf

Feature description was added to the changelog
Tests covering your feature were added?
Any new feature or behavior change was documented?

Closes #3237

Edited May 05, 2022 by Tobias Meisel

Valid HDF5 file and XDMF file after simulation abort