Summary

Error-detecting codes (EDC) provide a way to identify data that has been corrupted during storage or transmission. Checksums and cyclic redundancy codes (CRC) are examples of classes of EDCs. Some HDF5 users have expressed a desire to have HDF5 use EDCs to verify when metadata and/or raw data stored in the file is corrupted. Among the most critical requirements for such codes, speed and efficient partial updates seem to be the most urgent. We list possible use cases for employing EDCs in HDF5, and recommend an initial implementation that uses Fletcher’s checksum, a version of the internet checksum algorithm. A workplan is proposed, in which checksums are first implemented for raw data, then later for metadata.

We are indebted to several HDF5 users for submitting many very useful comments in response to earlier versions of this document.

Error-detecting codes

An error-detecting code (EDC) is a computed value that depends on the contents of a block of data and that is transmitted or stored along with the data in order to detect corruption of the data. A receiving system re-computes the value based upon the received data and compares this value with the one stored with the data. If the two values are the same, the receiver has some confidence that the data was received correctly or was not corrupted during storage or transmission.

Some characteristics of different error detection methods are

(1) fast computation

(2) security

(3) ability to detect permuted elements

(4) support for making partial updates to a data stream

(5) error correction.

No single EDC has all of these characteristics. For instance, some checksums efficiently support partial updates but may not detect errors when data is simply re-ordered. On the other hand, a CRC (cyclic redundancy check) may provide more security than a checksum, but can take longer to compute.

EDCs in HDF5

HDF5 users have expressed a desire for some method to verify when metadata and/or raw data stored in a file is corrupted. Of the requirements described above (fast, secure, etc.), only “fast” and “efficient partial updates” have be requested, but it is easy to imagine that the other features would also be valuable.

Corruption can occur in either the raw data or metadata components of and HDF5 file. The term “raw data” here refers to the stored elements in the array associated with a dataset. Metadata occurs in dataset headers, in the data structures used to store groups, in file headers, and in a user block.

Since raw data and metadata are organized and treated differently by the HDF5 library, their EDCs will also need to be treated differently. Also, since raw data typically constitutes many more bytes than metadata, different algorithms may be advisable for each.

EDCs in HDF5 will also be handled differently for chunked datasets vs. contiguous datasets. It is relatively easy to apply EDCs to individual chunks, and partial reads and writes can be handled efficiently when chunking is used – only the chunks involved in a read or write need invoke any EDC calculations.

In contrast, when a part of a contiguous (i.e. non-chunked) dataset is read, its EDC can only be checked by reading the entire dataset. When part of a contiguous dataset is written (changed), then the efficiency of re-computing the dataset’s EDC depends on the EDC algorithm being used. Some EDCs can easily and quickly be re-computed when making partial changes, without having to re-compute the EDC for the entire dataset. Others cannot.

Usage examples

Our goal with this initial implementation is just to provide a simple, raw capability designed to support large, one-time writes of data. It must work in parallel, but can require the use of chunking. It does not have to work efficiently for applications that make partial changes to problem-sized data. Table 1 lists the uses of EDCs that have been identified so far as desirable.

Table 1. Use cases identified so far.

Use case	Requirement	Comment
1. Basic use of EDCs. An application writes HDF5 datasets to a file, and this applications or a later application must determine if corruption occurred when writing. Corruption could occur either in the raw data or metadata object.	An EDC is computed on each object before it is written to the HDF5 file. When any object is read later, the EDC is used to determine the validity of the data.	In version 1.
2. An application queries about whether an EDC exists, and if it does, the application queries about the value of an EDC.	A routine is available to ascertain whether or not an EDC exists. Another HDF5 library routine is available to query the value of an EDC.	Version 1.
3. An application writes data, but chooses not to have an EDC computed.	An HDF5 library routine is available to turn off the EDC computation for a dataset. If it is invoked, the dataset (or chunk) has no EDC. An EDC will always be computed for metadata.	Version 1. Should this be extended containers such as group or files?
4. An application writes a dataset that is stored in chunks, and writes a chunk at a time, with error detection.	Each chunk has it’s own EDC, so that it’s validity can be checked whenever the chunk is read back in.	Version 1.
5. An application does a partial write to a dataset that is chunked and has error detection, and the partial write changes the data in several chunks.	A new EDC is computed for every chunk that is altered during the partial write.	Version 1.
6. An application writes a compressed dataset with error detection.	An EDC is computed on the uncompressed data, after compression occurs.	Version 1.
7. An application writes to part of a dataset that is stored contiguously and has error detection.	If partial updates are supported a new EDC is computed by reading back the portion of the dataset to be replaced, computing new EDC, and writing the new partial dataset. If partial updates are not supported, a new EDC is computed by reading back the entire dataset, re-computing the EDC, then re-writing the dataset.	Version 1.
8. An application needs to compute the EDC of an existing dataset.	The entire dataset must be read into memory in order to do this. An HDF5 library routine is available to do this.	Version 1.
9. A parallel application using MPI-IO creates a file with a dataset. (This is the same as use case #1, but in parallel.)	EDCs are created for all data and metadata.	Version 1.
10. A parallel application writes to a dataset that is stored in chunks, with each chunk being written by a single processor. No two processors write to the same chunk.	An EDC is computed for each chunk that is accessed, similar to cases 2 & 3.	Version 1.
11. It is desired to use different EDC algorithms, including algorithms supplied by applications. One reason for this is to evolve to newer, better EDC algorithms that might occur in the future. Another is that different EDC algorithms have different strengths.	Put hooks in the library and API to support the use of different EDC algorithms.	Version 1 should contain the “hooks”, but will not provide alternative algorithms.
12. Part of a contiguously stored array is read. It is desired to know whether the data that has been read is valid.	Maintain and compute EDCs for parts of a contiguously stored dataset.	Not planned.

When should EDCs be required?

We propose that EDCs be used always for metadata, but be optional for data. We recommend EDCs for all metadata because it would make the coding (and code maintenance) simpler, and because metadata objects are small enough that the extra CPU cost is quite small compared to the actual I/O operations that are performed.

Raw data, in contrast, can be very large, and it seems likely that EDC calculations could appreciably increase the cost of I/O operations, particularly partial I/O operations on contiguous datasets. As in the metadata situation, some performance testing would need to be done before we can be sure of when the cost of EDC use is significant.

Choosing EDC algorithms

The array of available methods for error detection and correction is broad and deep. We have reviewed a number of possibilities, and reviewed materials on the possible options.[*] Specifically, we have identified MD5[†], Fletcher’s checksum[‡], Adler-32[§], and CRC methods[**]. Since the main EDC features that are critical for our use cases are speed and efficiency of partial replacements, we have narrowed down the possibilities to Fletcher’s checksum and Adler-32. Fletcher’s and Adler-32 are similar, but Fletcher’s is slightly simpler to implement. Fletcher’s is also an internet standard.

We note that Fletcher’s checksum algorithm does not meet certain other criteria (such as good security) that may be deemed important later, and hence it is important that the implementation design accommodate the use of different EDC algorithms, selectable by applications.

Unless there is a compelling reason to do otherwise, we recommend initially choosing the same EDC algorithm for both data and metadata. Since the implementation of EDCs for data will precede that for metadata,[††] there is time to change this recommendation In any case, later implementations may use different algorithms for the two, and the implementation design should take this into account.

Proposed plan of action

Characteristics of the HDF5 library and format make it more difficult to support error detection for metadata than for raw data. Raw data error detection involves less substantial changes to the file format, and could probably be implemented more quickly. Therefore, we propose a two-phase process. In Phase 1, we implement checksumming for dataset arrays, and phase 2 we implement error detection (probably the same algorithm) for HDF5 metadata.

Implementing checksum for raw data

Here is a tentative plan of action to checksum raw data.

· Research appropriate algorithm(s) for error detection for large amounts of information. Choice of an algorithm should be made based on the speed of the algorithm, the ability to handle very large data arrays, and the ability to compute EDCs on partial data.

· Propose format of new "EDC" object header message for storing in HDF5 object headers of datasets whose raw data is to be checked.

· Propose and agree on API extensions for enabling error detection on a per-object basis in the file. This will most likely be done with a new dataset creation property.

· Enhance internal library functions to create/update EDC when raw data is written to an object.

· Propose and agree on additional API functions for verifying an EDC on the raw data for an object, etc.

· Implement new API functions.

· Write regression tests for new features

· Write documentation, tutorial material, etc.

Implementing EDC for metadata

Here is a tentative plan of action to detect metadata errors.

· Research appropriate algorithm(s) for detecting errors in small objects. Choice of an algorithm will be influenced by the algorithm chosen for raw data error detection, but that algorithm may not be appropriate for the smaller amounts of information contained in the file's metadata.

· Propose format change to incorporate an EDC field into the different types of metadata stored in a file.

· Enhance internal library functions to create/update EDC when metadata is written to a file.

· Write regression tests for new features

· Write documentation, etc.

[*] “FITS Checksum Proposal” by Seaman, Pence and Rots is a good overview of the issues from the perspective of the need for checksums in a scientific data format. See http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/checksum23may02/checksum23may02.html.

[††] Doing the raw data implementation first will also give us an opportunity to get an idea about speeds, to help us determine whether the cost of computing checksums is high enough to merit making it selectable for metadata.