cait.datasets

class cait.datasets.CryoDataModule(hdf5_path, type, keys, channel_indices, feature_indices=None, transform=None, nmbr_events=None, double=False)[source]

Bases: object

Pytorch Lightning DataModule for processing of HDF5 dataset.

Parameters
  • hdf5_path (string) – Full path to the hdf5 data set.

  • type (string) – Either events or testpulses or noise - the group index of the hd5 data set.

  • keys (list of strings) – The keys that are accessed in the hdf5 group.

  • channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).

  • feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen)

  • transform (pytorch transforms class) – Get applied to every sample when getitem is called.

  • nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(f[‘events/event’][0]).

  • double (bool) – If true all events are cast to double before calculations.

prepare_data(val_size, test_size, batch_size, nmbr_workers, load_to_memory=False, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None, feature_keys=[], label_keys=[], keys_one_hot=[])[source]

Called once to hand additional info about the data setup.

Parameters
  • val_size (float between 0 and 1) – The size of the validation set.

  • test_size (float between 0 and 1) – The size of the test set.

  • batch_size (int) – The batch size in the training process.

  • nmbr_workers (int) – The number of processes to run, best choose the number of CPUs on the machine - this might cause issues if load_to_memory is not activated.

  • load_to_memory (bool) – Depricated! Not recommended! If set, the whole data gets loaded into memory.

  • dataset_size (int or None) – The size of the whole dataset, gets overwritten if only_idx is set.

  • only_idx (list of ints or None) – Only these indices are then used from the initial dataset/h5 file.

  • shuffle_dataset (bool) – The train set gets shuffled after every epoch.

  • random_seed (int or None) – If we want to use a random seed to reproduce the results.

  • feature_keys (list of strings) – Data from these keys is supposed to be input to the NN.

  • label_keys (list of strings) – Data from these keys is supposed to be labels for the NN training.

  • keys_one_hot (list of strings) – This data gets one-hot encoded.

setup()[source]

Called on every worker before start of training, here creation of dataset and splits in samplers are done.

test_dataloader(batch_size=None)[source]

Return the test data loader.

Parameters

batch_size (int) – The batchsize.

Returns

Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.

Return type

object

train_dataloader(batch_size=None)[source]

Return the training data loader.

Parameters

batch_size (int) – The batchsize.

Returns

Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.

Return type

object

val_dataloader(batch_size=None)[source]

Return the validation data loader.

Parameters

batch_size (int) – The batchsize.

Returns

Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.

Return type

object

class cait.datasets.DownSample(keys, down)[source]

Bases: object

Sample all the time series down.e

Parameters

keys (list of strings) – The keys in each sample-dist we want to downsample.

class cait.datasets.H5CryoData(type, keys, channel_indices, feature_indices=None, keys_one_hot=[], hdf5_path=None, transform=None, nmbr_events=None, double=False)[source]

Bases: object

Pytorch Dataset for the processing of raw data from hdf5 files with a Cait-like file structure.

Parameters
  • type (string,) – Either events or testpulses or noise - the group index of the hd5 data set.

  • keys (list of strings) – The keys that are accessed in the hdf5 group.

  • channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).

  • feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen).

  • keys_one_hot (list of strings) – The keys that get one hot encoded - important for correct size.

  • hdf5_path (string or None) – Full path to the hdf5 data set, need be provided if no file handle is set.

  • transform (pytorch transforms class) – Get applied to every sample when getitem is called.

  • nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(self.f[‘events/event’][0]).

  • double (bool) – If true all events are cast to double before calculations.

build_sample(f, idx)[source]

Build a sample that is handed to the pytorch model.

Parameters
  • f (hdf5 file stream) – The file handle to the HDF5 data set.

  • idx (int) – The index of the event to process from the HDF5 data set.

Returns

A dictionary with the data from the HDF5 data set.

Return type

dict

get_item_no_cache(idx)[source]

Returns the sample of the dataset at idx, without caching the file stream.

Parameters

idx (int) – The index at which we want to get the event.

Returns

Each element is a numpy array.

Return type

dict

class cait.datasets.Normalize(norm_vals, type='z')[source]

Bases: object

Normalize Features to given mean and std.

Parameters
  • norm_vals (dictionary) – Each key corresponds to a key in the sample and is a list of length two: [mean, std], or if type = ‘minmax’ then [min, max].

  • type (string) – ‘z’ for calculating Z-scores or ‘minmax’ of scaling from 0 to 1.

class cait.datasets.PileUpDownSample(keys, down)[source]

Bases: object

A transform that downsamples samples with pile up events.

This is different to the usual downsample transform, because the pile up events form a dataset of shape (2, record_length), while usual events are a data set (record_length).

Parameters
  • keys (list) – The keys of the sample (which is a dict) that are to downsample.

  • down (int) – The value by which we want to downsample.

class cait.datasets.RemoveOffset(keys)[source]

Bases: object

Remove on all events the offset.

Parameters

keys (list of strings) – The keys in the each sample-dict from that we want to remove the offset.

class cait.datasets.SingleMinMaxNorm(keys)[source]

Bases: object

A transform that normalizes to the min-max range 0 to 1.

Parameters

keys (list) – The keys of the sample (which is a dict) that are to normalize.

class cait.datasets.ToTensor[source]

Bases: object

Convert numpy arrays in sample to Tensors.

cait.datasets.get_random_samplers(test_size, val_size, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None)[source]

Chooses the indices for the Split datasets.

Parameters
  • test_size – float between 0 and 1, the size of the testset

  • val_size – float between 0 and 1, the size of the validation set

  • dataset_size – Size of the whole dataset, is a number

  • only_idx – list of ints or None, if set only these indices from the dataset are included

  • shuffle_dataset – When true, the indices are dataset is shuffled befor the indices are assigned

  • random_seed – set of some value to get the same datasets always for comparability

Returns

indices for training, validation and test set