The NetCDF team would like HDF5 to support strings with UTF-8 Unicode character encoding.
Currently HDF5 officially supports only strings encoded in standard US ASCII. However, the HDF5 File Format Specification and other documentation is ambiguous, and the library does not check the encoding of strings.
Joel Spolsky has written an introduction to character sets and Unicode entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.”
Briefly: standard ASCII defines characters for byte values between 0 and 127. Values 128 through 255 are not defined by standard ASCII. UTF-8 byte values between 0 and 127 represent the same characters as ASCII--standard ASCII is a subset of UTF-8. ASCII values 128-255 represent different characters depending on which "code page" is currently loaded. UTF-8 also uses these values to represent characters outside of unaccented American English.
The convenient side-effect of this is that any UTF-8 string is also a valid ASCII string, although not necessarily one with the same meaning. Any string consisting of only standard ASCII characters (unaccented American English characters) is identical in ASCII and UTF-8 encodings.
UTF-8 does store some characters as multiple bytes (up to four bytes); all multibyte characters use byte values of 128-255. NULL-termination, space padding, and C string routines all operate identically on UTF-8 as they do on ASCII characters and strings. The fact that UTF-8 has mutibyte characters means that the number of bytes in a string is not necessarily the number of characters in that string, but the number of bytes is usually the important factor for storing and manipulating the string.
The ASCII and UTF-8 must be displayed differently for characters outside of the standard ASCII set, but this is the responsibility of the displaying software, not of HDF5. ASCII and UTF-8 strings can be stored and manipulated identically by HDF5.
Write tests to ensure that UTF-8 characters can be used in the library.
The tests would ensure that strings including non-ASCII characters don't
break any functionality, and can be returned to the user unaltered. This
would need to be tested everywhere the library uses character strings,. Table
1 gives a list of where strings are used in the HDF5 API.
Object |
Uses |
Comments |
Data (i.e., contents of attribute or dataset that are an array of strings. |
define datatype, get datatype |
User data should conform to the encodeing specified by the datatype, but th elibrary doesn't check this. |
Object names and paths (links to Group, Dataset, Named Datatype) |
create object open object interate group, get name from ID |
|
Reference to object or region |
create reference |
Should be same as path names. |
Soft link linkval |
set/get value |
Should be same as path names. |
Attribute names |
create, open get name |
|
Compound datatype field names |
define datatype, get datatype, select fields |
|
ENUM type names |
define datatype, retrieved from nameof, getmember |
|
opaque data type tag |
define datatype, retrieved from gettag |
|
error strings |
define error strings, push messages, retrieve stack |
Details changed in 1.8, waiting for documentation. |
property class name, filter name, |
register, get name get by name |
Predefined do not need UTF-8 option. Probably do not need to support UTF-8 for these. |
property name |
create property, set/get property value |
|
file names |
create, open, get name is_hdf5, mount/unmount |
Depends on file system? |
file names: external file, multi file, split file |
set, get file names |
Depends on file system? |
comment on filters |
set/get |
No need for UTF-8 |
comment on groups |
set/get |
Non need for UDF-8 |