DataHandler
- class cait.DataHandler(record_length: int = 16384, sample_frequency: int = 25000, channels: Optional[list] = None, nmbr_channels: Optional[int] = None, run: Optional[str] = None, module: Optional[str] = None)[source]
Bases:
cait.mixins._data_handler_simulate.SimulateMixin,cait.mixins._data_handler_rdt.RdtMixin,cait.mixins._data_handler_plot.PlotMixin,cait.mixins._data_handler_features.FeaturesMixin,cait.mixins._data_handler_analysis.AnalysisMixin,cait.mixins._data_handler_fit.FitMixin,cait.mixins._data_handler_csmpl.CsmplMixin,cait.mixins._data_handler_ml.MachineLearningMixin,cait.mixins._data_handler_bin.BinMixinA class for the processing of raw data events.
The DataHandler class is one of the core parts of the cait Package. An instance of the class is bound to a HDF5 file and stores all data from the recorded binary files (*.rdt, …), as well as the calculated features (main parameters, standard events, …) in the file.
- Parameters
record_length (int) – The number of samples in one record window. To ensure performance of all features, this should be a power of 2.
sample_frequency (int) – The sampling frequency of the recording.
channels (list of integers or None) – The channels in the *.rdt file that belong to the detector module. Attention - the channel number written in the *.par file starts counting from 1, while Cait, CCS and other common software frameworks start counting from 0.
nmbr_channels – The total number of channels.
run (string or None) – The number of the measurement run. This is a optional argument, to identify a measurement with a given module uniquely. Providing this argument has no effect, but might be useful in case you start multiple DataHandlers at once, to stay organized.
module (string or None) – The naming of the detector module. Optional argument, for unique identification of the physics data. Providing this argument has no effect, but might be useful in case you start multiple DataHandlers at once, to stay organized.
Example for the generation of an HDF5 for events from an test *.rdt file:
>>> import cait as ai >>> test_data = ai.data.TestData(filepath='test_001') >>> test_data.generate(source='hw') Rdt file written. Con file written. Par file written. >>> dh = ai.DataHandler() DataHandler instance created. >>> dh.convert_dataset(path_rdt='./', fname='test_001', path_h5='./') Start converting. READ EVENTS FROM RDT FILE. Total Records in File: 40 Event Counts: 19 WORKING ON EVENTS WITH TPA = 0. CREATE DATASET WITH EVENTS. CALCULATE MAIN PARAMETERS. WORKING ON EVENTS WITH TPA = -1. CREATE DATASET WITH NOISE. WORKING ON EVENTS WITH TPA > 0. CREATE DATASET WITH TESTPULSES. CALCULATE MP. Hdf5 dataset created in ./ Filepath and -name saved.
Most of the methods are included via parent mixin classes (see folder cait/mixins).
- content(group: Optional[str] = None, print_info: bool = False)[source]
Print the whole content of the HDF5 and all derived properties. The shape of the datasets as well as their datatypes are also given.
- Parameters
group (string or None) – The name of a group in the HDF5 file of which we print the content. If None, all groups are printed.
print_info (bool) – Print an explanation of the content output.
- downsample_raw_data(type: str = 'events', down: int = 16, dtype: str = 'float32', name_appendix: str = '', delete_old: bool = True, batch_size: int = 1000, repackage: bool = False)[source]
Downsample the dataset “event” from a specified group in the HDF5 file.
For large scale analysis and limited server space, the converted HDF5 datasets exceed storage space capacities. For this scenario, the raw data events can be downsampled by a given factor. Downsampling to sample frequencies below 1kHz is in many situations sufficient for viewing events and most features calculations.
Attention: Without repackaging, this method does NOT decrease the HDF5 file’s size! See
cait.DataHandler.repackage()for details.- Parameters
type (string) – The group in the HDF5 set from which the events are downsampled, typically ‘events’, ‘testpulses’ or noise.
down (int) – The factor by which the data is downsampled. This should be a factor of 2.
dtype (string) – The data type of the new stored events, typically you want this to be float32.
name_appendix (string) – An appendix to the dataset event in order to keep the old and the new events.
delete_old (bool) – If true, the old events are deleted. Deactivate only, if an unique name_appendix is chosen.
batch_size (int) – The batch size for the copy, reduce if you face memory problems.
repackage (bool) – If set to True, the HDF5 file will be repackaged.
>>> dh.downsample_raw_data() Old Dataset Event deleted from group events. New Dataset Event with downsample rate 16 created in group events.
- drop(group: str, dataset: Optional[str] = None, repackage: bool = False)[source]
Delete a dataset from a specified group in the HDF5 file. If no dataset is provided, the entire group is deleted.
Attention: Without repackaging, this method does NOT decrease the HDF5 file’s size! See
cait.DataHandler.repackage()for details.- Parameters
group (string) – The name of the group in the HDF5 file.
dataset (string) – The name of the dataset in the HDF5 file. If None, the would group is deleted.
repackage (bool) – If set to True, the HDF5 file will be repackaged.
- drop_raw_data(type: str = 'events', repackage: bool = False)[source]
Delete the dataset “event” from a specified group in the HDF5 file.
Attention: Without repackaging, this method does NOT decrease the HDF5 file’s size! See
cait.DataHandler.repackage()for details.- Parameters
type (string) – The group in the HDF5 set from which the events are deleted, typically ‘events’, ‘testpulses’ or noise.
repackage (bool) – If set to True, the HDF5 file will be repackaged.
>>> dh.drop_raw_data() Dataset Event deleted from group events.
- generate_startstop()[source]
Generate a startstop data set in the metainfo group from the testpulses time stamps.
- get(group: str, dataset: str, idx0: Optional[Union[int, List[Union[int, bool]]]] = None, idx1: Optional[Union[int, List[Union[int, bool]]]] = None, idx2: Optional[Union[int, List[Union[int, bool]]]] = None)[source]
Get a dataset from the HDF5 file with save closing of the file stream. The additional indices idx0, idx1 and idx2 can be integers, lists of integers or boolean arrays, and are used where appropriate. E.g. a 3-dimensional dataset will accept all three indices while a 2d set ignores the last one, etc. If boolean arrays are used, their shape has to match the data’s shape along the respective dimension.
- Parameters
group (string) – The name of the group in the HDF5 set.
dataset (string) – The name of the dataset in the HDF5 set. There are special key word for calculated properties from the main parameters, namely ‘pulse_height’, ‘onset’, ‘rise_time’, ‘decay_time’, ‘slope’. These are consistent with used in the cut when generating a standard event.
idx0 (int) – An index passed to the data set inside the HDF5 file as the first index, before it is converted to a numpy array. If left at None value, the slice : operator is passed instead.
idx1 (int) – An index passed to the data set inside the HDF5 file as the second index, before it is converted to a numpy array. If left at None value, the slice : operator is passed instead.
idx2 (int) – An index passed to the data set inside the HDF5 file as the third index, before it is converted to a numpy array. If left at None value, the slice : operator is passed instead.
- Returns
The dataset from the HDF5 file
- Return type
numpy array
- get_event_iterator(group: str, channel: Optional[int] = None, flag: Optional[List[bool]] = None, batch_size: Optional[int] = None)[source]
Returns EventIterator object that can be used to iterate events of a given group and channel. When used within a with statement, the corresponding HDF5 file is kept open for faster access.
- Parameters
group (string) – The name of the group in the HDF5 file.
channel (int) – The channel to use. Defaults to None, which means “all channels”
flag (list of bool) – A boolean flag of events to include in the iterator
- Returns
EventIterator
- Return type
Context Manager / Iterator
>>> # Usage as regular iterator (HDF5 file is separately opened/closed for each event) >>> ev_it = dh.get_event_iterator("events", 0) >>> for ev in ev_it: ... print(np.max(ev))
>>> # Usage as context manager (HDF5 file is kept open) >>> with dh.get_event_iterator("events", 0) as ev_it: ... for ev in ev_it: ... print(np.max(ev))
- get_filedirectory(absolute: bool = False)[source]
Get the relative path to the directory where the HDF5 file assigned to this instance of DataHandler is stored.
- Parameters
absolute (bool) – If true, the absolute path is returned instead.
- Returns
Path to the directory of the HDF5 file.
- Return type
str
- get_filehandle(path: Optional[str] = None, mode: str = 'r+')[source]
Get the opened filestream to the HDF5 file.
This is usually needed for individual feature calculations, plots or cuts, that are not ready-to-play implemented.
- Parameters
path (string or None) – Provide an alternative full path to the HDF5 file of that we want to open the file stream.
- Returns
The opened file stream. Please look into the h5py Python library for details about the file stream.
- Return type
h5py file stream
>>> with dh.get_filehandle() as f: ... f.keys() ... <KeysViewHDF5 ['events', 'noise', 'testpulses']>
- get_filename()[source]
Get name of the HDF5 file assigned to this instance of DataHandler.
- Parameters
absolute (bool) – If true, the absolute path is returned instead.
- Returns
Name of the HDF5 file (without *.h5 extension).
- Return type
str
- get_filepath(absolute: bool = False)[source]
Get the relative path to the HDF5 file assigned to this instance of DataHandler.
- Parameters
absolute (bool) – If true, the absolute path is returned instead.
- Raises
Exception – If filepath has not yet been set (using DataHandler.set_filepath()).
FileNotFoundError – If the HDF5 file corresponding to the set filepath does not exist. In such a chase, an empty HDF5 file can be created using DataHandler.init_empty().
- Returns
Path to the file connected to this DataHandler.
- Return type
str
- import_labels(path_labels: str, type: str = 'events', path_h5=None)[source]
Include the *.csv file with the labels into the HDF5 File.
- Parameters
path_labels (string) – Path to the folder that contains the csv file. E.g. “data/” looks for labels in “data/labels_bck_0XX_<type>”.
type (string) – The group name in the HDF5 file of the events, typically “events” or “testpulses”.
path_h5 (string or None) – Provide an alternative full path to the HDF5 file to include the labels, e.g. “data/hdf5s/bck_001[…].h5”.
>>> ei = ai.EventInterface() Event Interface Instance created. >>> ei.load_h5(path='./',fname='test_001',channels=[0,1]) Nmbr triggered events: 4 Nmbr testpulses: 11 Nmbr noise: 4 Bck File loaded. >>> ei.create_labels_csv(path='./') >>> dh.import_labels(path_labels='./') Added Labels.
- import_predictions(model: str, path_predictions: str, type: str = 'events', only_channel: Optional[int] = None, path_h5: Optional[str] = None)[source]
Include the *.csv file with the predictions from a machine learning model into the HDF5 File.
- Parameters
model (string) – The naming for the type of model, e.g. Random Forest –> “RF”.
path_predictions (string) – Path to the folder that contains the csv file. E.g. “data/” –> look for predictions in “data/<model>_predictions_<self.fname>_<type>”. If the argument only_channel is not None, then additionally “_Ch<only_channel>” is append to the looked for file.
type (string) – The name of the group in the HDF5 file, typically “events” or “testpulses”.
only_channel (int or None) – If the labels are only for a specific channel then define here for which channel.
path_h5 (string or None) – Provide an alternative (full) path to the HDF5 file, e.g. “data/hdf5s/bck_001[…].h5”.
>>> dh.import_predictions(model='RF', path_predictions='./') Added RF Predictions.
- include_iterator(group: str, dataset: str, it: Type[cait.versatile.iterators._IteratorBaseClass], event_axis: int = 1)[source]
Includes the events returned by an iterator into a specified group/dataset. Note that this method does not support iterators that return events in batches.
- Parameters
group (str) – The target group in the HDF5 file.
dataset (str) – The target dataset in the HDF5 file.
it (Type[IteratorBaseClass]) – The iterator whose events we want to include.
event_axis (int) – The axis along which you want to stack the events. If you include event voltage traces, you most likely want the final dataset to have shape (n_channels, n_events, record_length). In this case, the event_axis is 1. An event_axis of 0 would result in a dataset of shape (n_events, n_channels, record_length). Defaults to 1.
- keys(group: Optional[str] = None)[source]
Print the keys of the HDF5 file or a group within it.
- Parameters
group (string or None) – The name of a group in the HDF5 file of that we print the keys.
- record_window(ms=True)[source]
Get the t array corresponding to a typical record window.
- Parameters
ms (bool) – If true, the time is in ms. Otherwise in s.
- Returns
the time array.
- Return type
1D numpy array
- rename(group: Optional[str] = None, **kwargs: str)[source]
Rename groups or datasets in the HDF5 file. Names to change are passed as keyword arguments.
By default, group is set to None. In this case, **kwargs are interpreted as HDF5 group names to change.
If group is set (e.g. to ‘events’ or ‘noise’), **kwargs are interpreted as HDF5 dataset names within that group.
Notice that we forbid to rename virtual datasets or groups that contain virtual datasets as this could lead to confusion (it is best practice to keep the dataset names between the ‘master file’ and the source files consistent)
- Parameters
group (str, Default: None) – The group within which we want to rename datasets. If set to None, groups themselves will be renamed.
kwargs (str) – groups/datasets to be renamed. Pass a keyword argument of the form old_name=new_name for every dataset/group that you want to rename in the HDF5 file.
>>> # Rename groups 'old_group1' and 'old_group2' to 'new_group1' and 'new_group2' >>> dh.rename(old_group1='new_group1', old_group2='new_group2')
>>> # Rename datasets 'old_ds1' and 'old_ds2' in group 'noise' to 'new_ds1' and 'new_ds2' >>> dh.rename(group='noise', old_ds1='new_ds1', old_ds2='new_ds2')
- repackage()[source]
Repackage the HDF5 file of DataHandler to reduce its file size in case datasets were deleted previously.
For large scale analysis and limited server space, the converted HDF5 datasets exceed storage space capacities. For this scenario, the raw data events can be deleted after the calculation of all useful features. At a later point, the events can be included again if needed. Similarly, one might have included some temporary datasets which one wishes to delete at a later point to avoid clutter. Unwanted datasets can be dropped using
cait.DataHandler.drop()andcait.DataHandler.drop_raw_data(), HOWEVER this does not reduce the HDF5 file’s size due to the tree structure of the HDF5 file! For reducing the file size, the HDF5 file has to be repacked with the h5repack method of the HDF5 Tools, see https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Repack.This method is equivalent to and can also be done on Ubuntu/Mac e.g. with
>>> h5repack test_data/test_001.h5 test_data/test_001_copy.h5 >>> rm test_data/test_001.h5 >>> mv test_data/test_001_copy.h5 test_data/test_001.h5
- set(group: str, n_channels: Optional[int] = None, channel: Optional[int] = None, change_existing: bool = False, overwrite_existing: bool = False, write_to_virtual: Optional[bool] = None, dtype: Optional[str] = None, **kwargs: List[Union[float, bool]])[source]
Include data into the HDF5 file. Datasets are passed as keyword arguments and the keys are used as names for the datasets. E.g. set(“events”, pulse_heights=data) creates a dataset “pulse_heights” in the group “events”. The shape of the dataset matches data’s shape. Alternatively, one-dimensional data can be written to a multi-dimensional array (as is often necessary for multiple channels). This is achieved by specifying the number of desired channels (n_channels) and the channel index (channel) to write to. E.g. set(“events”, n_channels=2, channel=0, pulse_heights=data) creates a “pulse_heights” dataset in the “events” group of shape (2, *data.shape), and data is written into the 0-th channel. Notice that in most cases you probably want data to be of shape (n, ). Otherwise it will probably lead to unexpectedly high-dimensional datasets.
- Parameters
group (string) – The name of the group in the HDF5 file. If it doesn’t exist yet, it will be created.
n_channels (int) – The number of channels that the data should have (first dimension of the dataset).
channel (int) – The channel that the data gets added to.
change_existing (bool) – If set to True, already existing datasets are overwritten. For that, the shape and dtype of the new dataset have to match the already existing one’s.
overwrite_existing (bool) – If set to True, already existing datasets are overwritten in case the new dtype and/or shape does not match the existing dtype/shape.
write_to_virtual (bool, Default: None) – If set to True and the target dataset is an already existing virtual dataset, the new data is written to the virtual dataset, i.e. it will end up in the source HDF5 files. This might be intended but in most cases, you will probably want this to be set to False to avoid unexpectedly changing remote files. Note that this parameter is None by default and has to be set to True or False when attempting to write to a virtual dataset. Note further, that if set to True, the shape and dtype of the new dataset must match the virtual dataset exactly. Otherwise, a non-remote dataset is created regardless.
dtype (string) – The desired dtype of the dataset in the HDF5 file. If none is specified, “bool” and “float32” are used for boolean and numeric arrays, respectively.
kwargs (List[Union[float, bool]]) – datasets to include. Pass a keyword argument of the form dataset_name=dataset_data for every dataset that you want to include in the HDF5 file.
>>> # Include 'data1' and 'data2' as datasets 'new_ds1' and 'new_ds2' in group 'noise' >>> # ('new_ds1' and 'new_ds2' do not yet exist) >>> dh.set(group="noise", new_ds1=data1, new_ds2=data2)
>>> # Include 'data1' and 'data2' as datasets 'ds1' and 'ds2' in group 'noise' >>> # (either or both of 'ds1' and 'ds2' already exist and have correct shape/dtype for new >>> # data) >>> dh.set(group="noise", ds1=data1, ds2=data2, change_existing=True)
>>> # Include 'data1' and 'data2' as datasets 'ds1' and 'ds2' in group 'noise' >>> # (either or both of 'ds1' and 'ds2' already exist and have incorrect shape/dtype for new >>> # data, but we want to force the new dtype/shape) >>> dh.set(group="noise", ds1=data1, ds2=data2, overwrite_existing=True)
>>> # Include 'data1' and 'data2' as datasets 'ds1' and 'ds2' in group 'noise' >>> # ('data1' and 'data2' are 1-dimensional but we want to create 2-dimensional >>> # datasets (for different channels e.g.) and write the data into the 0-th channel. This also >>> # works for writing single channels to already existing multi-channel datasets.) >>> dh.set(group="noise", n_channels=2, channel=0, ds1=data1, ds2=data2)
>>> # Include 'data1' as dataset 'ds1' in group 'noise' >>> # ('ds1' already exists and is a virtual dataset with matching shape but dtype 'float64'. >>> # We want to write to the original data in the respective source files.) >>> dh.set(group="noise", ds1=data1, dtype='float64', write_to_virtual=True)
>>> # Include 'data1' as dataset 'ds1' in group 'noise' >>> # ('ds1' already exists and is a virtual dataset. We want to overwrite it and create a >>> # non-virtual dataset instead) >>> dh.set(group="noise", ds1=data1, write_to_virtual=False)
- set_filepath(path_h5: str, fname: str, appendix: bool = True, channels: Optional[list] = None)[source]
Set the path to the *.h5 file for further processing.
This function is usually called right after the initialization of a new object. If the instance has already done the conversion from *.rdt to *.h5, the path is already set automatically and the call is obsolete.
- Parameters
path_h5 (string) – The path to the directory that contains the H5 file, e.g. “data/” –> file name “data/bck_001-P_Ch01-L_Ch02.csv”.
fname (string) – The name of the H5 file.
appendix (bool) – If true, an appendix like “-P_ChX-[…]” is automatically appended to the path_h5 string.
channels (list of integers or None) – The channels in the *.rdt file that belong to the detector module. Attention - the channel number written in the *.par file starts counting from 1, while Cait, CCS and other common software frameworks start counting from 0.
>>> dh.set_filepath(path_h5='./', fname='test_001')
- truncate_raw_data(type: str, truncated_idx_low: int, truncated_idx_up: int, dtype: str = 'float32', name_appendix: str = '', delete_old: bool = True, repackage: bool = False)[source]
Truncate the record window of the dataset “event” from a specified group in the HDF5 file.
For measurements with high event rate (above ground, …) a long record window might be counter productive, due to more piled up events in the window. For this reason, you can truncate the length of the record window with this function.
Attention: Without repackaging, this method does NOT decrease the HDF5 file’s size! See
cait.DataHandler.repackage()for details.- Parameters
type (string) – The group in the HDF5 set from which the events are downsampled, typically ‘events’, ‘testpulses’ or noise.
truncated_idx_low (int) – The lower index within the old record window, that becomes the first index in the truncated record window.
truncated_idx_up (int) – The upper index winthin the old record window, that becomes the last index in the truncated record window.
dtype (string) – The data type of the new stored events, typically you want this to be float32.
name_appendix (string) – An appendix to the dataset event in order to keep the old and the new events.
delete_old (bool) – If true, the old events are deleted. Deactivate only, if an unique name_appendix is choosen.
repackage (bool) – If set to True, the HDF5 file will be repackaged.