cait.datasets

class cait.datasets.CryoDataModule(hdf5_path, type, keys, channel_indices, feature_indices=None, transform=None, nmbr_events=None, double=False)[source]

Bases: object

Pytorch Lightning DataModule for processing of HDF5 dataset.

Parameters

hdf5_path (string) – Full path to the hdf5 data set.
type (string) – Either events or testpulses or noise - the group index of the hd5 data set.
keys (list of strings) – The keys that are accessed in the hdf5 group.
channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).
feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen)
transform (pytorch transforms class) – Get applied to every sample when getitem is called.
nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(f[‘events/event’][0]).
double (bool) – If true all events are cast to double before calculations.

prepare_data(val_size, test_size, batch_size, nmbr_workers, load_to_memory=False, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None, feature_keys=[], label_keys=[], keys_one_hot=[])[source]

Called once to hand additional info about the data setup.

Parameters

val_size (float between 0 and 1) – The size of the validation set.
test_size (float between 0 and 1) – The size of the test set.
batch_size (int) – The batch size in the training process.
nmbr_workers (int) – The number of processes to run, best choose the number of CPUs on the machine - this might cause issues if load_to_memory is not activated.
load_to_memory (bool) – Depricated! Not recommended! If set, the whole data gets loaded into memory.
dataset_size (int or None) – The size of the whole dataset, gets overwritten if only_idx is set.
only_idx (list of ints or None) – Only these indices are then used from the initial dataset/h5 file.
shuffle_dataset (bool) – The train set gets shuffled after every epoch.
random_seed (int or None) – If we want to use a random seed to reproduce the results.
feature_keys (list of strings) – Data from these keys is supposed to be input to the NN.
label_keys (list of strings) – Data from these keys is supposed to be labels for the NN training.
keys_one_hot (list of strings) – This data gets one-hot encoded.

setup()[source]: Called on every worker before start of training, here creation of dataset and splits in samplers are done.

test_dataloader(batch_size=None)[source]

Return the test data loader.

Parameters: batch_size (int) – The batchsize.
Returns: Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.
Return type: object

train_dataloader(batch_size=None)[source]

Return the training data loader.

Parameters: batch_size (int) – The batchsize.
Returns: Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.
Return type: object

val_dataloader(batch_size=None)[source]

Return the validation data loader.

Parameters: batch_size (int) – The batchsize.
Returns: Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.
Return type: object

class cait.datasets.DownSample(keys, down)[source]

Bases: object

Sample all the time series down.e

Parameters: keys (list of strings) – The keys in each sample-dist we want to downsample.

class cait.datasets.H5CryoData(type, keys, channel_indices, feature_indices=None, keys_one_hot=[], hdf5_path=None, transform=None, nmbr_events=None, double=False)[source]

Bases: object

Pytorch Dataset for the processing of raw data from hdf5 files with a Cait-like file structure.

Parameters

type (string,) – Either events or testpulses or noise - the group index of the hd5 data set.
keys (list of strings) – The keys that are accessed in the hdf5 group.
channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).
feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen).
keys_one_hot (list of strings) – The keys that get one hot encoded - important for correct size.
hdf5_path (string or None) – Full path to the hdf5 data set, need be provided if no file handle is set.
transform (pytorch transforms class) – Get applied to every sample when getitem is called.
nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(self.f[‘events/event’][0]).
double (bool) – If true all events are cast to double before calculations.

build_sample(f, idx)[source]

Build a sample that is handed to the pytorch model.

Parameters

f (hdf5 file stream) – The file handle to the HDF5 data set.
idx (int) – The index of the event to process from the HDF5 data set.

Returns

A dictionary with the data from the HDF5 data set.

Return type

dict

get_item_no_cache(idx)[source]

Returns the sample of the dataset at idx, without caching the file stream.

Parameters: idx (int) – The index at which we want to get the event.
Returns: Each element is a numpy array.
Return type: dict

class cait.datasets.Normalize(norm_vals, type='z')[source]

Bases: object

Normalize Features to given mean and std.

Parameters

norm_vals (dictionary) – Each key corresponds to a key in the sample and is a list of length two: [mean, std], or if type = ‘minmax’ then [min, max].
type (string) – ‘z’ for calculating Z-scores or ‘minmax’ of scaling from 0 to 1.

class cait.datasets.PileUpDownSample(keys, down)[source]

Bases: object

A transform that downsamples samples with pile up events.

This is different to the usual downsample transform, because the pile up events form a dataset of shape (2, record_length), while usual events are a data set (record_length).

Parameters

keys (list) – The keys of the sample (which is a dict) that are to downsample.
down (int) – The value by which we want to downsample.

class cait.datasets.RemoveOffset(keys)[source]

Bases: object

Remove on all events the offset.

Parameters: keys (list of strings) – The keys in the each sample-dict from that we want to remove the offset.

class cait.datasets.SingleMinMaxNorm(keys)[source]

Bases: object

A transform that normalizes to the min-max range 0 to 1.

Parameters: keys (list) – The keys of the sample (which is a dict) that are to normalize.

class cait.datasets.ToTensor[source]

Bases: object

Convert numpy arrays in sample to Tensors.

cait.datasets.get_random_samplers(test_size, val_size, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None)[source]

Chooses the indices for the Split datasets.

Parameters

test_size – float between 0 and 1, the size of the testset
val_size – float between 0 and 1, the size of the validation set
dataset_size – Size of the whole dataset, is a number
only_idx – list of ints or None, if set only these indices from the dataset are included
shuffle_dataset – When true, the indices are dataset is shuffled befor the indices are assigned
random_seed – set of some value to get the same datasets always for comparability

Returns

indices for training, validation and test set