Error-detecting Codes for HDF5
Quincey Koziol & Mike Folk
December 11, 2002
Error-detecting codes (EDC) provide a way to identify data that has been corrupted during storage or transmission. Checksums and cyclic redundancy codes (CRC) are examples of classes of EDCs. Some HDF5 users have expressed a desire to have HDF5 use EDCs to verify when metadata and/or raw data stored in the file is corrupted. Among the most critical requirements for such codes, speed and efficient partial updates seem to be the most urgent. We list possible use cases for employing EDCs in HDF5, and recommend an initial implementation that uses Fletcher’s checksum, a version of the internet checksum algorithm. A workplan is proposed, in which checksums are first implemented for raw data, then later for metadata.
We are indebted to several HDF5 users for submitting many very
useful comments in response to earlier versions of this document.
An error-detecting code (EDC) is a computed value that depends on the contents of a block of data and that is transmitted or stored along with the data in order to detect corruption of the data. A receiving system re-computes the value based upon the received data and compares this value with the one stored with the data. If the two values are the same, the receiver has some confidence that the data was received correctly or was not corrupted during storage or transmission.
Some characteristics of different error detection methods are
(1) fast computation
(2) security
(3) ability to detect permuted elements
(4) support for making partial updates to a data stream
(5) error correction.
No single EDC has all of
these characteristics. For instance, some checksums
efficiently support partial updates but may not
detect errors when data is simply re-ordered. On the
other hand, a CRC (cyclic
redundancy check) may provide more security than a checksum, but can
take longer to compute.
HDF5 users have expressed a desire for some method to verify
when metadata and/or raw data stored in a file is corrupted. Of the
requirements described above (fast, secure, etc.), only “fast” and “efficient partial
updates” have be requested, but it is easy to imagine that the other features
would also be valuable.
Corruption can occur in either the raw data or metadata components of and HDF5 file. The term “raw data” here refers to the stored elements in the array associated with a dataset. Metadata occurs in dataset headers, in the data structures used to store groups, in file headers, and in a user block.
Since raw data and metadata are organized and treated differently by the HDF5 library, their EDCs will also need to be treated differently. Also, since raw data typically constitutes many more bytes than metadata, different algorithms may be advisable for each.
EDCs in HDF5 will also be handled differently for chunked datasets vs. contiguous datasets. It is relatively easy to apply EDCs to individual chunks, and partial reads and writes can be handled efficiently when chunking is used – only the chunks involved in a read or write need invoke any EDC calculations.
In contrast, when a part of a contiguous (i.e. non-chunked) dataset is read, its EDC can only be checked by reading the entire dataset. When part of a contiguous dataset is written (changed), then the efficiency of re-computing the dataset’s EDC depends on the EDC algorithm being used. Some EDCs can easily and quickly be re-computed when making partial changes, without having to re-compute the EDC for the entire dataset. Others cannot.
Our goal with this initial implementation is just to provide a simple, raw capability designed to support large, one-time writes of data. It must work in parallel, but can require the use of chunking. It does not have to work efficiently for applications that make partial changes to problem-sized data. Table 1 lists the uses of EDCs that have been identified so far as desirable.
Table 1. Use cases identified so far.
Use case |
Requirement |
Comment |
1.
Basic use of EDCs. An application writes HDF5 datasets to a
file, and this applications or a later application must determine if
corruption occurred when writing.
Corruption could occur either in the raw data or metadata object. |
An EDC is computed on each object before it is written to the HDF5 file. When any object is read later, the EDC is used to determine the validity of the data. |
In
version 1. |
2.
An application queries about whether an
EDC exists, and if it does, the application queries about the
value of an EDC. |
A routine is available to
ascertain whether or not an EDC exists. Another HDF5 library
routine is available to query the value of an EDC. |
Version
1. |
3.
An application writes data, but chooses not to have an
EDC computed. |
An HDF5 library routine is
available to turn off the EDC computation for a dataset. If it is invoked, the dataset (or chunk)
has no EDC.
An EDC will always be computed for
metadata. |
Version
1. Should this be extended containers such as group or files? |
4.
An application writes a dataset that is stored in chunks, and writes
a chunk at a time, with error detection. |
Each chunk has it’s own EDC,
so that it’s validity can be checked whenever the chunk is read back in. |
Version
1. |
5.
An application does a partial write to a dataset that
is chunked and has error detection, and the
partial write changes the data in several chunks. |
A new EDC is
computed for every chunk that is altered during the partial write. |
Version
1. |
6.
An application writes a compressed dataset with
error detection. |
An EDC is
computed on the uncompressed data, after compression occurs. |
Version
1. |
7.
An application writes to part of a dataset that is
stored contiguously and has error detection. |
If partial updates are supported a new EDC is computed by reading back the portion of the dataset to be replaced, computing new EDC, and writing the new partial dataset. If partial updates are not
supported, a new EDC is computed by reading back the entire
dataset, re-computing the EDC, then re-writing the
dataset. |
Version
1. |
8.
An application needs to compute the EDC
of an existing dataset. |
The entire dataset must be read
into memory in order to do this. An
HDF5 library routine is available to do this. |
Version
1. |
9.
A parallel application using MPI-IO creates a file
with a dataset. (This is the same as
use case #1, but in parallel.) |
EDCs are
created for all data and metadata. |
Version
1. |
10. A
parallel application writes to a dataset that is stored in chunks, with each
chunk being written by a single processor.
No two processors write to the same chunk. |
An EDC is
computed for each chunk that is accessed, similar to cases 2 & 3. |
Version
1. |
11. It
is desired to use different EDC algorithms, including algorithms
supplied by applications. One reason for this is to evolve to newer,
better EDC algorithms that might occur in
the future. Another is that different EDC
algorithms have different strengths. |
Put hooks in the library and API
to support the use of different EDC
algorithms. |
Version
1 should contain the “hooks”, but will not provide alternative algorithms. |
12. Part
of a contiguously stored array is read.
It is desired to know whether the data that has been read is
valid. |
Maintain and compute EDCs
for parts of a contiguously stored dataset. |
Not
planned. |
We propose that EDCs be used always for metadata, but be optional for data. We recommend EDCs for all metadata because it would make the coding (and code maintenance) simpler, and because metadata objects are small enough that the extra CPU cost is quite small compared to the actual I/O operations that are performed.
Raw data, in contrast, can be very large, and it seems likely that EDC calculations could appreciably increase the cost of I/O operations, particularly partial I/O operations on contiguous datasets. As in the metadata situation, some performance testing would need to be done before we can be sure of when the cost of EDC use is significant.
The array of available methods for error detection and correction is broad and deep. We have reviewed a number of possibilities, and reviewed materials on the possible options.[*] Specifically, we have identified MD5[†], Fletcher’s checksum[‡], Adler-32[§], and CRC methods[**]. Since the main EDC features that are critical for our use cases are speed and efficiency of partial replacements, we have narrowed down the possibilities to Fletcher’s checksum and Adler-32. Fletcher’s and Adler-32 are similar, but Fletcher’s is slightly simpler to implement. Fletcher’s is also an internet standard.
We note that Fletcher’s checksum algorithm does not meet certain other criteria (such as
good security) that may be deemed important later, and hence it is important
that the implementation design accommodate the use of different EDC
algorithms, selectable by applications.
Unless there is a compelling reason to do otherwise, we recommend initially choosing the same EDC algorithm for both data and metadata. Since the implementation of EDCs for data will precede that for metadata,[††] there is time to change this recommendation In any case, later implementations may use different algorithms for the two, and the implementation design should take this into account.
Characteristics of the HDF5 library and format make it more
difficult to support error detection for metadata than for
raw data. Raw data error detection involves
less substantial changes to the file format, and could probably be implemented
more quickly. Therefore, we propose a
two-phase process. In Phase 1, we
implement checksumming for dataset arrays, and phase 2 we implement error
detection (probably the same algorithm) for HDF5 metadata.
Here is a tentative plan of action to checksum raw data.
· Research appropriate algorithm(s) for error detection for large amounts of information. Choice of an algorithm should be made based on the speed of the algorithm, the ability to handle very large data arrays, and the ability to compute EDCs on partial data.
· Propose format of new "EDC" object header message for storing in HDF5 object headers of datasets whose raw data is to be checked.
· Propose and agree on API extensions for enabling error detection on a per-object basis in the file. This will most likely be done with a new dataset creation property.
· Enhance internal library functions to create/update EDC when raw data is written to an object.
· Propose and agree on additional API functions for verifying an EDC on the raw data for an object, etc.
· Implement new API functions.
· Write regression tests for new features
· Write documentation, tutorial material, etc.
Here is a tentative plan of action to detect metadata errors.
· Research appropriate algorithm(s) for detecting errors in small objects. Choice of an algorithm will be influenced by the algorithm chosen for raw data error detection, but that algorithm may not be appropriate for the smaller amounts of information contained in the file's metadata.
· Propose format change to incorporate an EDC field into the different types of metadata stored in a file.
· Enhance internal library functions to create/update EDC when metadata is written to a file.
· Write regression tests for new features
· Write documentation, etc.
[*] “FITS Checksum Proposal” by Seaman, Pence and Rots is
a good overview of the issues from the perspective of the need for checksums in
a scientific data format. See http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/checksum23may02/checksum23may02.html.
[‡] http://rfc.sunsite.dk/rfc/rfc1071.html. See also RFC 1141 and 1624.
[**] For a
discussion comparing CRC with Fletcher’s and Adler-32 checksums, see http://www.netzmafia.de/rfc/internet-drafts/draft-cavanna-iscsi-crc-vs-cksum-00.txt.
[††] Doing the raw data implementation first will also give us an opportunity to get an idea about speeds, to help us determine whether the cost of computing checksums is high enough to merit making it selectable for metadata.