Summary
Prior to HDF5 Release 1.10.0, data chunks on the edge of a dataset
were stored with the same size as other chunks,
even if the logical size of the chunk was smaller.
With this feature, HDF5 adds an option controlling
whether filters are applied to partial edge chunks.
Background and the New Option
Data chunking can, in many cases, greatly improve dataset I/O performance.
In some cases, however, chunking can result in performance degradation.
Consider an extensible dataset that will be opened, extended,
and closed many times. If the dataset size is large, compression and
chunked storage can yield substantial file size and I/O performance
benefits.
However, since the number of elements per extension may vary, it is
unlikely that the dataset size will always be a multiple of the chunk size,
and partial edge chunks will be present.
Compression of the partial edge chunks in this usage model may
introduce a substantial, and sometimes unacceptable, performance penalty
each time the dataset is extended. The penalty occurs not only because
the filter must be applied twice to all edge chunks after each extension
(the original compressed partial chunk is first uncompressed,
new data is then added to the chunk,
and the extended chunk is recompressed),
but also because the compressed size of the edge chunks changes as the
dataset grows, requiring new placement of the chunks in the file.
The movement of chunks in the file degrades write performance and
can also cause fragmentation, which adds wasted space in the file.
In order to extend datasets as quickly as possible without this option,
it has been necessary to store the entire dataset uncompressed.
A new option to control the filtering of partial edge chunks
overcomes the performance degradation described above.
With this option, partial edge chunks are stored without compression.
If the dataset is subsequently extended, any partial edge chunk that
becomes a complete chunk will then be compressed for storage and
new partial edge chunks will remain uncompressed in storage.
The double filtering of partial edge chunks is eliminated.
In disabling filters on partial edge chunks, this option not only
reduces filtering overhead, it also reduces fragmentation when datasets
are extended and chunks must be moved while still allowing completed chunks
in the dataset to be compressed.
When a dataset expands or shrinks, it is possible that
one or more chunks will go from partial edge to complete,
or complete to partial edge.
When filters are disabled for partial edge chunks in a dataset
and a chunk in that dataset undergoes a change of classification,
the HDF5 Library will reallocate storage for the chunk and apply or
disable filters depending on the final classification of the chunk.
This option is controlled by a bit flag in a function parameter.
The parameter is manipulated by two API functions,
H5Pset_chunk_opts and
H5Pget_chunk_opts ,
which act on a dataset creation property list.
Compatiblity Issue
Disabling filters for partial edge chunks was not available
in HDF5 releases prior to 1.10.0 and its implementation
requires a modification to the HDF5 file format specification.
Therefore, datasets created with this option
will not be accessible using earlier HDF5 releases.
|