cait.datasets
- class cait.datasets.CryoDataModule(hdf5_path, type, keys, channel_indices, feature_indices=None, transform=None, nmbr_events=None, double=False)[source]
Bases:
objectPytorch Lightning DataModule for processing of HDF5 dataset.
- Parameters
hdf5_path (string) – Full path to the hdf5 data set.
type (string) – Either events or testpulses or noise - the group index of the hd5 data set.
keys (list of strings) – The keys that are accessed in the hdf5 group.
channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).
feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen)
transform (pytorch transforms class) – Get applied to every sample when getitem is called.
nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(f[‘events/event’][0]).
double (bool) – If true all events are cast to double before calculations.
- prepare_data(val_size, test_size, batch_size, nmbr_workers, load_to_memory=False, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None, feature_keys=[], label_keys=[], keys_one_hot=[])[source]
Called once to hand additional info about the data setup.
- Parameters
val_size (float between 0 and 1) – The size of the validation set.
test_size (float between 0 and 1) – The size of the test set.
batch_size (int) – The batch size in the training process.
nmbr_workers (int) – The number of processes to run, best choose the number of CPUs on the machine - this might cause issues if load_to_memory is not activated.
load_to_memory (bool) – Depricated! Not recommended! If set, the whole data gets loaded into memory.
dataset_size (int or None) – The size of the whole dataset, gets overwritten if only_idx is set.
only_idx (list of ints or None) – Only these indices are then used from the initial dataset/h5 file.
shuffle_dataset (bool) – The train set gets shuffled after every epoch.
random_seed (int or None) – If we want to use a random seed to reproduce the results.
feature_keys (list of strings) – Data from these keys is supposed to be input to the NN.
label_keys (list of strings) – Data from these keys is supposed to be labels for the NN training.
keys_one_hot (list of strings) – This data gets one-hot encoded.
- setup()[source]
Called on every worker before start of training, here creation of dataset and splits in samplers are done.
- test_dataloader(batch_size=None)[source]
Return the test data loader.
- Parameters
batch_size (int) – The batchsize.
- Returns
Instance of FastDataLoader, a child of the PyTorch DataLoader, developed from within the PyTorch community.
- Return type
object
- class cait.datasets.DownSample(keys, down)[source]
Bases:
objectSample all the time series down.e
- Parameters
keys (list of strings) – The keys in each sample-dist we want to downsample.
- class cait.datasets.H5CryoData(type, keys, channel_indices, feature_indices=None, keys_one_hot=[], hdf5_path=None, transform=None, nmbr_events=None, double=False)[source]
Bases:
objectPytorch Dataset for the processing of raw data from hdf5 files with a Cait-like file structure.
- Parameters
type (string,) – Either events or testpulses or noise - the group index of the hd5 data set.
keys (list of strings) – The keys that are accessed in the hdf5 group.
channel_indices (list of lists or Nones) – Must have same length than the keys list, the channel indices of the data sets in the group. If None then no index is set (i.e. if the h5 data set does not belong to a specific channel).
feature_indices (list of lists or Nones) – Must have same length than the keys list, the feature indices of the data sets in the group (third idx). If None then no index is set (i.e. there is no third index in the set or all features are chosen).
keys_one_hot (list of strings) – The keys that get one hot encoded - important for correct size.
hdf5_path (string or None) – Full path to the hdf5 data set, need be provided if no file handle is set.
transform (pytorch transforms class) – Get applied to every sample when getitem is called.
nmbr_events (int or None) – If set this is the number of events in the data set, if not it is extracted from the hdf5 file with len(self.f[‘events/event’][0]).
double (bool) – If true all events are cast to double before calculations.
- class cait.datasets.Normalize(norm_vals, type='z')[source]
Bases:
objectNormalize Features to given mean and std.
- Parameters
norm_vals (dictionary) – Each key corresponds to a key in the sample and is a list of length two: [mean, std], or if type = ‘minmax’ then [min, max].
type (string) – ‘z’ for calculating Z-scores or ‘minmax’ of scaling from 0 to 1.
- class cait.datasets.PileUpDownSample(keys, down)[source]
Bases:
objectA transform that downsamples samples with pile up events.
This is different to the usual downsample transform, because the pile up events form a dataset of shape (2, record_length), while usual events are a data set (record_length).
- Parameters
keys (list) – The keys of the sample (which is a dict) that are to downsample.
down (int) – The value by which we want to downsample.
- class cait.datasets.RemoveOffset(keys)[source]
Bases:
objectRemove on all events the offset.
- Parameters
keys (list of strings) – The keys in the each sample-dict from that we want to remove the offset.
- class cait.datasets.SingleMinMaxNorm(keys)[source]
Bases:
objectA transform that normalizes to the min-max range 0 to 1.
- Parameters
keys (list) – The keys of the sample (which is a dict) that are to normalize.
- cait.datasets.get_random_samplers(test_size, val_size, dataset_size=None, only_idx=None, shuffle_dataset=True, random_seed=None)[source]
Chooses the indices for the Split datasets.
- Parameters
test_size – float between 0 and 1, the size of the testset
val_size – float between 0 and 1, the size of the validation set
dataset_size – Size of the whole dataset, is a number
only_idx – list of ints or None, if set only these indices from the dataset are included
shuffle_dataset – When true, the indices are dataset is shuffled befor the indices are assigned
random_seed – set of some value to get the same datasets always for comparability
- Returns
indices for training, validation and test set