Valid HDF5 file and XDMF file after simulation abort
For problem description and initial discussion, see #3237.
Overview
This MR introduces a new parameter for the PRJ-File. It is obligatory within the optional hdf section. false: Fast write is disabled (Default) true: Fast write is enabled
<time_loop>
<processes> ...
</processes>
<output>
...
<hdf>
<number_of_files>1</number_of_files>
<fast_write>0</fast_write>
</hdf>
</output>
</time_loop>
Usage
The new parameter is necessary because file I/O performance and data integrity are opposed here.
Disable fast write
Choose false (Fast write disable) if:
- You need to have readable result files (HDF/XDMF) in case of
- unintended simulation termination (crash) and you want to analyse output data
- intended simulation abort (SIG_INT): Hint: Stop simulation right after output. Be aware that termination in the middle of I/O process can destroy all data
- You need readable data when simulation run (InSitu)
- You don't care about file IO performance when simulation runs (but you need good performance in your postprocessing)
Enable fast write
Choose true (Fast write enable) if:
- You need best file IO performance (simulation and postprocessing)
Implementation
HDF Files are flushed after each timestep (flush is a collective methode->synchronisation barrier). It continues when the OS overtakes the writing! (not when all data is written)
HDF-SWMR:
Main idea: We could (ab)use the data consistency model that is provided with SWMR.
Test:
A test with HDF 1.12.1 on a (minimal example)[TobiasMeisel/minimal_examples!2 (closed)] was conducted. The routine spend almost all time in a writing process. By that a SIGINT was send during writing. The resulting h5-File was corrupted. It was possible to recover with h5clear -s <file_name>
but all the file was empty. With adding a new datagroup (with data) it was tested if only the group/dataset that is written in the moment is effected -> but still results in an empty file.
Conclusion: SWMR is not a solution to address this problem properly. HDF is simply not suitable for "streaming", where the stream can break at anytime. If we want recoverable data, then we need a new file (at each time step).
XDMF Files are written completely new after each timestep (that`s fine because it in kB-Range -> µs)
Discussion
Option with close/open file gives the same risks as with flush (invalid data when terminated with writing procedure) Possible tools have been tested: https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Clear and no easy and always applicable solution was found. Long discussion to recover some data can be found in the HDF forum (e.g. https://forum.hdfgroup.org/t/file-state-after-flush-and-crash/3481/4) . It can not be suggested!
Most likely this MR is not the final solution, but hopefully a good first step. If it turns out that this solution is not sufficient alternatives are:
- Write 1 file per time step
- https://portal.hdfgroup.org/display/HDF5/Design+HDF5+-+SWMR+Functions?preview=%2F50892963%2F50892964%2FDesign-HDF5-SWMR-functions.pdf
-
Feature description was added to the changelog -
Tests covering your feature were added? -
Any new feature or behavior change was documented?
Closes #3237