Please see The HDF Group's new Support Portal for the latest information.
Linkage Disequilibrium (LD) deals with finding non-random allele associations at different chromosome loci in population genetics. The complexity of an LD calculation is O(mn^2), where m the size of the population being studied and n is the number of loci being considered. Due to the large number of SNPs, genome-level LD analysis is extrememly time consuming. On a single-processor workstation, this computation could take months, but the same computation could be done on a large supercomputer in a matter of days. Our experiments showed that for chromosome 22, a complete LD analysis could be done within 4 minutes on a 32 node cluster.
The storage requirements for the linkage array are also very high - O(n^2). In order to store, exchange and visualize this data, we need a data storage technology using which was designed to handle large quantities of data such as HDF5. With HDF5 and compression enabled, we can fit the main LD dataset into 4.5 MB and the HDF5 data structures allow us to easily store and retrieve lower-resolution data for visualization with a minimal impact on the overall file size (5.5 MB total).
The main contributions of this work are:
- Proposing the parallelizing of the LD algorithm so that the results can be generated quickly on large supercomputers.
- Proposing compression and chunking for storing the entire array.
- Proposing a hierarchy of images of reducing resolution to allow for efficient interactive visualization.
Examples of LD data in HDF5 files
The LD_22.h5 file contains the LD values, calculated using the r^2 metric, for chromosome 22. The file has 3 different datasets. The Chromosome22 dataset contains the entire LD matrix and the other two contain the matrix at lower resolutions. In the future HDFView could be extended so that scientists could make selections in the lower resolution datasets and directly zoom into the higher resolution datasets. The LD_19.h5 file uses the same structure for chromosome 19.
You can inspect the files with HDFView after downloading them to your computer.
CAUTION: Attempting to open any dataset except the lowest resolution might result in the viewer crashing. This is a known problem where HDFView has trouble opening extremely large datasets and this will be addressed in a future version of HDFView.
As of now the only way to access the higher resolution images is by using the open as functionality and manually subsetting the data. While the size of the dataset is not very high, (achieved through compression) the actual number of elements is still very high. DO NOT CLICK on Chromosome22 dataset to see the data after opening the file with HDFView. Use "Right click" on the dataset, then choose "Open As" from the menu; dialog window with a small image will appear; you may which to choose "Image" to display the dataset; use the left mouse button to select a small region.
- - Last modified: 15 September 2016