Skip to content

Valid HDF5 file and XDMF file after simulation abort

For problem description and initial discussion, see #3237.

Overview

This MR introduces a new parameter for the PRJ-File. It is obligatory within the optional hdf section. false: Fast write is disabled (Default) true: Fast write is enabled

<time_loop>
        <processes> ...
        </processes>
        <output>
            ...
            <hdf>
                <number_of_files>1</number_of_files>
                <fast_write>0</fast_write>
            </hdf>
        </output>
</time_loop>

Usage

The new parameter is necessary because file I/O performance and data integrity are opposed here.

Disable fast write

Choose false (Fast write disable) if:

  • You need to have readable result files (HDF/XDMF) in case of
    • unintended simulation termination (crash) and you want to analyse output data
    • intended simulation abort (SIG_INT): Hint: Stop simulation right after output. Be aware that termination in the middle of I/O process can destroy all data
  • You need readable data when simulation run (InSitu)
  • You don't care about file IO performance when simulation runs (but you need good performance in your postprocessing)

Enable fast write

Choose true (Fast write enable) if:

  • You need best file IO performance (simulation and postprocessing)

Implementation

HDF Files are flushed after each timestep (flush is a collective methode->synchronisation barrier). It continues when the OS overtakes the writing! (not when all data is written)

HDF-SWMR:

Main idea: We could (ab)use the data consistency model that is provided with SWMR.

Test: A test with HDF 1.12.1 on a (minimal example)[TobiasMeisel/minimal_examples!2 (closed)] was conducted. The routine spend almost all time in a writing process. By that a SIGINT was send during writing. The resulting h5-File was corrupted. It was possible to recover with h5clear -s <file_name> but all the file was empty. With adding a new datagroup (with data) it was tested if only the group/dataset that is written in the moment is effected -> but still results in an empty file.

Conclusion: SWMR is not a solution to address this problem properly. HDF is simply not suitable for "streaming", where the stream can break at anytime. If we want recoverable data, then we need a new file (at each time step).

XDMF Files are written completely new after each timestep (that`s fine because it in kB-Range -> µs)

Discussion

Option with close/open file gives the same risks as with flush (invalid data when terminated with writing procedure) Possible tools have been tested: https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Clear and no easy and always applicable solution was found. Long discussion to recover some data can be found in the HDF forum (e.g. https://forum.hdfgroup.org/t/file-state-after-flush-and-crash/3481/4) . It can not be suggested!

Most likely this MR is not the final solution, but hopefully a good first step. If it turns out that this solution is not sufficient alternatives are:

  1. Feature description was added to the changelog
  2. Tests covering your feature were added?
  3. Any new feature or behavior change was documented?

Closes #3237

Edited by Tobias Meisel

Merge request reports