Interacting with HDF5 files

cait uses HDF5 files to store data and intermediate results for later use. This can be voltage traces of triggered events, timestamps of testpulses, calculcated recoil energies and many more. The DataHandler class makes interacting with these files very convenient and safe. You should never be in need of opening the HDF5 file yourself – a practice that we heavily discourage as it can lead to corrupted files. In the following we explore the core features of the DataHandler class.

We start by importing cait

import cait as ai

Mock data and DataHandler object

The HDF5 file used in this notebook can be created using the following commands. If you already have an HDF5 file and a corresponding DataHandler object, you can skip this step.

test_data = ai.data.TestData(filepath='./mock_001', duration=10000)
test_data.generate()
dh = ai.DataHandler(channels=[0,1])
dh.convert_dataset(path_rdt='./', fname='mock_001', path_h5='./')
dh.set_filepath(path_h5='./', fname='mock_001')

dh.calc_mp()
dh.calc_mp(type='testpulses')
dh.calc_mp(type='noise')

dh.calc_additional_mp(type='events', no_of=True)
dh.calc_additional_mp(type='testpulses', no_of=True)
dh.calc_additional_mp(type='noise', no_of=True)

Finally, we create the DataHandler object:

# create DataHandler instance
dh = ai.DataHandler(channels=[0,1])

# set the path to the desired HDF5 file
dh.set_filepath(path_h5='../testdata', fname='mock_001', appendix=False)

DataHandler Instance created.

Inspecting File

A quick way to get a summary of a DataHandler’s properties is to print it.

print(dh)

DataHandler linked to HDF5 file '../testdata/mock_001.h5'
HDF5 file size on disk: 278.7 MiB
Groups in file: ['events', 'noise', 'testpulses'].

Time between first and last testpulse: 2.77 h
First testpulse on/at: 2020-10-16 20:22:06+00:00 (UTC)
Last testpulse on/at: 2020-10-16 23:08:36+00:00 (UTC)

Depending on the available groups/datasets in the HDF5 file, this summary is more or less informative (e.g. the testpulse stats can of course only be printed if available in the file). The filename/path/directory of the HDF5 file can be accessed via the following methods which support both relative and absolute paths. We highly recommend to use those methods when referencing the HDF5 file and to not directly use properties of the DataHandler.

dh.get_filepath() # ../testdata/mock_001.h5
dh.get_filename() # mock_001
dh.get_filedirectory() # ../testdata
dh.get_filedirectory(absolute=True) #/Users/.../testdata

We can get a more detailed view into the contents of the HDF5 file with the content-method which takes the group of interest as an argument. If no group is specified, the datasets of all groups are listed. This method can be used to inspect the datasets’ shapes and dtypes. If you want to get an explanation on how to interpret the output, set print_info=True.

dh.content("events", print_info=False)

events
  add_mainpar                 (2, 444, 16)     float64
  |array_max                  (2, 444)
  |array_min                  (2, 444)
  |var_first_eight            (2, 444)
  |mean_first_eight           (2, 444)
  |var_last_eight             (2, 444)
  |mean_last_eight            (2, 444)
  |var                        (2, 444)
  |mean                       (2, 444)
  |skewness                   (2, 444)
  |max_derivative             (2, 444)
  |ind_max_derivative         (2, 444)
  |min_derivative             (2, 444)
  |ind_min_derivative         (2, 444)
  |max_filtered               (2, 444)
  |ind_max_filtered           (2, 444)
  |skewness_filtered_peak     (2, 444)
  dac_output                  (444,)           float64
  event                       (2, 444, 16384)  float32
  hours                       (444,)           float64
  mainpar                     (2, 444, 10)     float64
  |pulse_height               (2, 444)
  |onset                      (2, 444)
  |rise_time                  (2, 444)
  |decay_time                 (2, 444)
  |slope                      (2, 444)
  time_mus                    (444,)           int32
  time_s                      (444,)           int32

Access Data in the HDF5

The groups and datasets in the HDF5 file underlying the DataHandler are accessed through the get method. To read out the full content of a dataset you specify the group and the dataset name. Alternatively, you can also slice the dataset already in the call to get. E.g. you can access only the first channel of the data, or provide a boolean flag (or a list of indices). The get method supports up to 3 indices corresponding to the (up to) 3 dimensions of the data, and they are understood in increasing order. Therefore, to index only the first and third dimension for example, you have to provide None for the second dimension. Notice that due to a h5py limitation, only one dimension at a time can be indexed with boolean flags/index lists.

# all pulse heights of all channel; shape: (2,444)
dh.get("events", "pulse_height", 0)
# voltage trace of the 37th event in first channel; shape: (16384,)
dh.get("events", "event", 0, 37) 

# voltage traces of all events in first channel with pulse height < 1; shape: (151, 16384)
flag = dh.get("events", "pulse_height", 0) < 1
dh.get("events", "event", 0, flag)

# pulse heights for all channels where pulse height of first channel <1; shape: (2,151)
dh.get("events", "pulse_height", None, flag) 

Note that the following two calls to get yield the same values yet the second one is more memory-efficient because the slicing is done on the HDF5 file. The first option loads the entire dataset into memory and performs the slicing afterwards.

dh.get("events", "event")[0,flag]  # slicing of numpy.ndarray
dh.get("events", "event", 0, flag) # slicing of HDF5 file

Write Data to the HDF5 File

To write data to the HDF5 file, you use the set method for which there are two main use cases:

writing an array into the file as is (i.e. we keep the shape and dtype the same)
creating a multi-channel array and writing only to one channel (e.g. you calculate some property for the phonon and light channel separately but want to write them to the same HDF5 dataset)

The two use cases are illustrated below. Notice that you can write multiple datasets at once as long as they are in the same group. Dataset names and their content are parsed as keyword arguments. If you want the datasets to have a specific dtype, you can hand it to set as well.

import numpy as np

data1, data2 = np.random.rand(1,37,100), np.random.rand(1,37,100)

# Include 'data1' and 'data2' as datasets 'new_ds1' and 'new_ds2' in group 'new_group' 
# ('new_ds1' and 'new_ds2' do not yet exist)
dh.set(group="new_group", new_ds1=data1, new_ds2=data2)

# Include 'data1' and 'data2' as datasets 'ds1_mc' and 'ds2_mc' in group 'new_group2' 
# ('data1' and 'data2' are 1-dimensional but we want to create 2-dimensional 
# datasets (for different channels e.g.) and write the data into the 0-th channel. This also
# works for writing single channels to already existing multi-channel datasets.)
dh.set(group="new_group2", n_channels=2, channel=0, ds1_mc=data1, ds2_mc=data2)

If the dataset(s) already exist in the given group, you have to explicitly allow set to change them:

# Include 'data1' and 'data2' as datasets 'ds1' and 'ds2' in group 'new_group' 
# (either or both of 'ds1' and 'ds2' already exist and have correct shape/dtype for new
# data)
dh.set(group="new_group", ds1=data1, ds2=data2, change_existing=True) 

# Include 'data1' and 'data2' as datasets 'ds1' and 'ds2' in group 'new_group' 
# (either or both of 'ds1' and 'ds2' already exist and have incorrect shape/dtype for new
# data, but we want to force the new dtype/shape)
dh.set(group="new_group", ds1=data1, ds2=data2, overwrite_existing=True)

By calling content again, we can look at the newly created datasets.

Renaming and Deleting Datasets/Groups, Repackaging

You can rename and delete datasets/groups using the methods rename and drop as follows. Just like with set, you can do this for multiple datasets simultaneously as long as they are within the same group. The syntax once more uses keyword arguments.

Disclaimer: Due to the possibility of changing both the group name and dataset names with a single method, renaming groups called group is obviously not possible.

# Rename groups 'new_group' and 'new_group2' to 'new_group_renamed' and 'new_group2_renamed'
dh.rename(new_group='new_group_renamed', new_group2='new_group2_renamed')

# Rename datasets 'ds1' and 'ds2' in group 'new_group_renamed' to 'new_ds1' and 'new_ds2'
dh.rename(group='new_group_renamed', ds1='ds1_renamed', ds2='ds2_renamed')

Groups/datasets are deleted as follows:

# Drop group 'new_group_renamed'
dh.drop("new_group_renamed")

# Drop dataset 'ds1' from group 'new_group_renamed'
dh.drop("new_group2_renamed", "ds1_mc")

Important note: Deleting groups/datasets does not decrease the HDF5 file’s size due to the HDF5 file’s data structure. To decrease the size, repackaging has to be performed for which we provide the method repackage. Alternatively, this method can automatically be called when deleting a dataset/group using the repackage=True argument of the drop method.

dh.repackage()

Successfully repackaged '../testdata/mock_001.h5'. Memory saved: 194.4 KiB

Functionality for Virtual Datasets

The HDF5 library allows to link separate files into a single file without copying the data. The resulting file looks identical to a regular HDF5 file. To avoid confusion in such cases, a number of quality-of-life features are included in DataHandler. The need for virtual datasets arises for example when analysing more than one file at a time. Below we show how files can be combined (see also dedicated tutorial for merging files) and how to interact with them.

import shutil # just used to create a quick copy of our already existing mock data file
import os     # just used to create a quick copy of our already existing mock data file
import cait.versatile as vai

shutil.copy(dh.get_filepath(), os.path.join(dh.get_filedirectory(),'to_merge1.h5'))
shutil.copy(dh.get_filepath(), os.path.join(dh.get_filedirectory(),'to_merge2.h5'))

vai.file.combine(fname="merged_file",
                 files=["to_merge1", "to_merge2"], 
                 src_dir="../testdata",
                 out_dir="../testdata", 
                 groups_combine=["testpulses","noise"], 
                 groups_include=[])

Successfully combined files ['to_merge1', 'to_merge2'] into '../testdata/merged_file.h5' (16.8 KiB).

As you can see, the resulting file size is only a few KiB due to no data being copied. We can now create a DataHandler to the combined file and inspect it with our usual methods:

dh_combined = ai.DataHandler(channels=[0,1])
dh_combined.set_filepath(path_h5='../testdata', fname='merged_file', appendix=False)

print(dh_combined)

DataHandler linked to HDF5 file '../testdata/merged_file.h5'
HDF5 file size on disk: 16.8 KiB
Groups in file: ['noise', 'testpulses'].

The HDF5 file contains virtual datasets linked to the following files: {'../testdata/to_merge2.h5', '../testdata/to_merge1.h5'}
All of the external sources are currently available.

Time between first and last testpulse: 2.77 h
First testpulse on/at: 2020-10-16 20:22:06+00:00 (UTC)
Last testpulse on/at: 2020-10-16 23:08:36+00:00 (UTC)

dh_combined.content("testpulses")

The HDF5 file contains the following groups and datasets, which can be accessed through get(group, dataset). If present, some contents of the mainpar and add_mainpar datasets are displayed as well. For convenience, they can also be accessed through get(), even though they are not separate datasets in the HDF5 file.
Datasets marked with (v) are virtual datasets, i.e. they are stored in another (or multiple other) HDF5 file(s). They are treated like regular datasets but be aware that writing to such datasets actually writes to the respective original files.

testpulses
  add_mainpar             (v) (2, 2666, 16)     float64
  |array_max                  (2, 2666)
  |array_min                  (2, 2666)
  |var_first_eight            (2, 2666)
  |mean_first_eight           (2, 2666)
  |var_last_eight             (2, 2666)
  |mean_last_eight            (2, 2666)
  |var                        (2, 2666)
  |mean                       (2, 2666)
  |skewness                   (2, 2666)
  |max_derivative             (2, 2666)
  |ind_max_derivative         (2, 2666)
  |min_derivative             (2, 2666)
  |ind_min_derivative         (2, 2666)
  |max_filtered               (2, 2666)
  |ind_max_filtered           (2, 2666)
  |skewness_filtered_peak     (2, 2666)
  dac_output              (v) (2666,)           float64
  event                   (v) (2, 2666, 16384)  float32
  hours                   (v) (2666,)           float64
  mainpar                 (v) (2, 2666, 10)     float64
  |pulse_height               (2, 2666)
  |onset                      (2, 2666)
  |rise_time                  (2, 2666)
  |decay_time                 (2, 2666)
  |slope                      (2, 2666)
  testpulseamplitude      (v) (2666,)           float64
  time_mus                (v) (2666,)           int32
  time_s                  (v) (2666,)           int32

We see that print(dh_combined) tells us which files are linked to the HDF5 file of our DataHandler object. In principle it could happend that some of these source files get deleted. In this case, you could still create the DataHandler but subsequent calculations can have unexpected behaviour. This is why print(dh_combined) also tells you whether all sources are available or not.

Using content now also changed slightly as it shows you an indicator next to datasets which are virtual.

In principle, this is all the difference there is. You can do analysis with dh_combined in the same way you would do with dh. The only thing worth mentioning is that writing to virtual datasets is special. Depending on your use case you could either want to change the data in all source files (which could lead to confusion and unexpected results), or you could want to detach the virtual dataset and write to a regular dataset of that name. In most cases, the latter will probably be the desired behaviour. As a safety feature, the set method checks whether you are attempting to write to a virtual dataset or not and you have to make this decision upon calling it.

# Include 'new_data' as dataset 'hours' in group 'noise' 
# ('hours' already exists and is a virtual dataset with matching shape but dtype 'float64'.
# We want to write to the original data in the respective source files.)
old_data = dh_combined.get("noise","hours")
new_data = old_data + 2 # shift in time-axis
dh_combined.set(group="noise", hours=new_data, dtype='float64', write_to_virtual=True)

# Include 'new_data' as dataset 'hours' in group 'noise' 
# ('hours' already exists and is a virtual dataset. We want to overwrite it and create a 
# non-virtual dataset instead)
dh_combined.set(group="noise", hours=new_data, write_to_virtual=False)