Return to “Advanced Topics.”
This document assumes a working familiarity with UTF-8 and Unicode.
Any reader who is unfamiliar with UTF-8 encoding should read the
Wikipedia UTF-8 article
(https://en.wikipedia.org/wiki/UTF-8
)
before proceeding; it provides an excellent primer.
For our context, the most important UTF-8 concepts are:
More specific technical details will only become important if they affect the specifics of your application design or implementation.
H5Pset_char_encoding
,
which sets the character encoding used for object and attribute names.
For example, the following call sequence could be used to create a dataset with its name encoded with the UTF-8 character set:
lcpl_id = H5Pcreate(H5P_LINK_CREATE) ; error = H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8) ; dataset_id = H5Dcreate2(group_id, "datos_ñ", datatype_id, dataspace_id, lcpl_id, H5P_DEFAULT, H5P_DEFAULT) ;
If the character encoding of an attribute name is unknown, the
combination of an H5Aget_create_plist
call and an
H5Pget_char_encoding
call will reveal that information.
If the character encoding of an object name is unknown, the
information can be accessed through the object’s
H5L_info_t
structure which can be obtained using
H5Lvisit
or H5Lget_info_by_idx
calls.
H5Tset_cset
,
which sets the character encoding to be used in building a character
datatype.
For example, the following commands could be used to create an 8-character, UTF-8 encoded, string datatype for use in either an attribute or dataset:
datatype_id = H5Tcopy(H5T_C_S1) ; error = H5Tset_cset(datatype_id, H5T_CSET_UTF8) ; error = H5Tset_size(datatype_id, "8") ;
If a character or string datatype’s character encoding is unknown,
the combination of an H5Aget_type
or H5Dget_type
call and an H5Tget_cset
call can be used to determine that.
Be aware, however, of system or application limitations once data or other information has been extracted from an HDF5 file. The application or system must be designed to accommodate UTF-8 encodings if the information is then used elsewhere in the application or system environment.
Data from a UTF-8 encoded HDF5 datatype, in either a dataset or an attribute, that has been established within an HDF5 application should “just work” within the HDF5 portions of the application.
Linux and Mac OS systems normally handle UTF-8 encoded filenames correctly while Windows systems generally do not.
When working with Unicode text, one can no longer assume a 1:1 correspondence between the number of characters and the data storage requirement.
Mac OS systems generally handle UTF-8 encodings correctly.
Windows systems use a different Unicode encoding, UCS-2 (discussed in this UTF-16 article) at the system level. Within an HDF5 file and application on a Windows system, UTF-8 encoding should work correctly and as expected. Problems may arise, however, when that UTF-8 encoding is exposed directly to the Windows system. For example:
h5ls
or h5dump
, for example) emits text output,
the Windows system must interpret the character encodings.
If that output is UTF-8 encoded, Windows will correctly
interpret only those characters in the ASCII subset of UTF-8.
For object and attribute names:
H5Pset_char_encoding
H5Pget_char_encoding
|
For dataset and attribute datatypes:
H5Tset_cset
H5Tget_cset
|
UTF-8 article on Wikipedia |
Return to “Advanced Topics.”
The HDF Group Help Desk:
Describes HDF5 Release 1.8.20, November 2017. |
Copyright by
The HDF Group
and the Board of Trustees of the University of Illinois |