xenonpy.model.training package

Subpackages

Submodules

xenonpy.model.training.base module

class xenonpy.model.training.base.BaseExtension[source]

Bases: object

after_proc(trainer=None, is_training=True, *_dependence)[source]
Return type:

None

before_proc(trainer=None, is_training=True, *_dependence)[source]
Return type:

None

input_proc(x_in, y_in, *_dependence)[source]
Return type:

Tuple[Any, Any]

on_checkpoint(checkpoint, trainer=None, is_training=True, *_dependence)[source]
Return type:

None

on_reset(trainer=None, is_training=True, *_dependence)[source]
Return type:

None

output_proc(y_pred, y_true, trainer=None, is_training=True, *_dependence)[source]
Return type:

Tuple[Any, Any]

step_forward(step_info, trainer=None, is_training=True, *_dependence)[source]
Return type:

None

class xenonpy.model.training.base.BaseLRScheduler(lr_scheduler, **kwargs)[source]

Bases: object

__call__(optimizer)[source]
Parameters:

optimizer (Optimizer) – Wrapped optimizer.

Return type:

_LRScheduler

class xenonpy.model.training.base.BaseOptimizer(optimizer, **kwargs)[source]

Bases: object

__call__(params)[source]
Parameters:

params (Iterable) – iterable of parameters to optimize or dicts defining parameter groups

Returns:

optimizer

Return type:

Optimizer

class xenonpy.model.training.base.BaseRunner(cuda=False)[source]

Bases: BaseEstimator

static check_device(cuda)[source]
Return type:

device

extend(*extension)[source]

Add training extensions to trainer.

Parameters:

extension (BaseExtension) – Extension.

Return type:

BaseRunner

input_proc(x_in, y_in=None, **kwargs)[source]
output_proc(y_pred, y_true=None, **kwargs)[source]
remove_extension(*extension)[source]
T_Extension_Dict

alias of Dict[str, Tuple[BaseExtension, Dict[str, list]]]

property cuda
property device
property timer

xenonpy.model.training.checker module

class xenonpy.model.training.checker.Checker(path=None, *, increment=False, device='cpu', default_handle=(<module 'joblib' from '/home/docs/checkouts/readthedocs.org/user_builds/xenonpy/conda/stable/lib/python3.8/site-packages/joblib/__init__.py'>, '.pkl.z'))[source]

Bases: object

Check point.

Parameters:
  • path (Union[Path, str]) – Dir path for data access. Can be Path or str. Given a relative path will be resolved to abstract path automatically.

  • increment (bool) – Set to True to prevent the potential risk of overwriting. Default False.

__call__(handle=None, **named_data)[source]

Save data with or without name. Data with same name will not be overwritten.

Parameters:
  • handle (Tuple[Callable, str]) –

  • named_data (dict) – Named data as k,v pair.

classmethod load(model_path)[source]
set_checkpoint(**kwargs)[source]
property describe

Description for this model.

Returns:

Description.

Return type:

dict

property files
property final_state
property init_state
property model

returns: model – A pytorch model. :rtype: torch.nn.Module

property model_class
property model_name

returns: Model name. :rtype: str

property model_params
property model_structure
property path
property trained_model
property training_info

xenonpy.model.training.clip_grad module

class xenonpy.model.training.clip_grad.ClipNorm(max_norm, norm_type=2)[source]

Bases: object

Clips gradient norm of an iterable of parameters.

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

Parameters:
  • max_norm (float or int) – max norm of the gradients

  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.

Returns:

Total norm of the parameters (viewed as a single vector).

__call__(params)[source]

Call self as a function.

class xenonpy.model.training.clip_grad.ClipValue(clip_value)[source]

Bases: object

Clips gradient of an iterable of parameters at specified value.

Gradients are modified in-place.

Parameters:

clip_value (float or int) – maximum allowed value of the gradients. The gradients are clipped in the range \(\left[\text{-clip\_value}, \text{clip\_value}\right]\)

__call__(params)[source]

Call self as a function.

xenonpy.model.training.loss module

class xenonpy.model.training.loss.BCELoss(weight=None, size_average=None, reduce=None, reduction='mean')[source]

Bases: _WeightedLoss

Creates a criterion that measures the Binary Cross Entropy between the target and the input probabilities:

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_n \left[ y_n \cdot \log x_n + (1 - y_n) \cdot \log (1 - x_n) \right],\]

where \(N\) is the batch size. If reduction is not 'none' (default 'mean'), then

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets \(y\) should be numbers between 0 and 1.

Notice that if \(x_n\) is either 0 or 1, one of the log terms would be mathematically undefined in the above loss equation. PyTorch chooses to set \(\log (0) = -\infty\), since \(\lim_{x\to 0} \log (x) = -\infty\). However, an infinite term in the loss equation is not desirable for several reasons.

For one, if either \(y_n = 0\) or \((1 - y_n) = 0\), then we would be multiplying 0 with infinity. Secondly, if we have an infinite loss value, then we would also have an infinite term in our gradient, since \(\lim_{x\to 0} \frac{d}{dx} \log (x) = \infty\). This would make BCELoss’s backward method nonlinear with respect to \(x_n\), and using it for things like linear regression would not be straight-forward.

Our solution is that BCELoss clamps its log function outputs to be greater than or equal to -100. This way, we can always have a finite loss value and a linear backward method.

Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size nbatch.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar. If reduction is 'none', then \((*)\), same shape as input.

Examples:

>>> m = nn.Sigmoid()
>>> loss = nn.BCELoss()
>>> input = torch.randn(3, requires_grad=True)
>>> target = torch.empty(3).random_(2)
>>> output = loss(m(input), target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.BCEWithLogitsLoss(weight=None, size_average=None, reduce=None, reduction='mean', pos_weight=None)[source]

Bases: _Loss

This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_n \left[ y_n \cdot \log \sigma(x_n) + (1 - y_n) \cdot \log (1 - \sigma(x_n)) \right],\]

where \(N\) is the batch size. If reduction is not 'none' (default 'mean'), then

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets t[i] should be numbers between 0 and 1.

It’s possible to trade off recall and precision by adding weights to positive examples. In the case of multi-label classification the loss can be described as:

\[\ell_c(x, y) = L_c = \{l_{1,c},\dots,l_{N,c}\}^\top, \quad l_{n,c} = - w_{n,c} \left[ p_c y_{n,c} \cdot \log \sigma(x_{n,c}) + (1 - y_{n,c}) \cdot \log (1 - \sigma(x_{n,c})) \right],\]

where \(c\) is the class number (\(c > 1\) for multi-label binary classification, \(c = 1\) for single-label binary classification), \(n\) is the number of the sample in the batch and \(p_c\) is the weight of the positive answer for the class \(c\).

\(p_c > 1\) increases the recall, \(p_c < 1\) increases the precision.

For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to \(\frac{300}{100}=3\). The loss would act as if the dataset contains \(3\times 100=300\) positive examples.

Examples:

>>> target = torch.ones([10, 64], dtype=torch.float32)  # 64 classes, batch size = 10
>>> output = torch.full([10, 64], 1.5)  # A prediction (logit)
>>> pos_weight = torch.ones([64])  # All weights are equal to 1
>>> criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
>>> criterion(output, target)  # -log(sigmoid(1.5))
tensor(0.20...)
Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size nbatch.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

  • pos_weight (Tensor, optional) – a weight of positive examples. Must be a vector with length equal to the number of classes.

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar. If reduction is 'none', then \((*)\), same shape as input.

Examples:

>>> loss = nn.BCEWithLogitsLoss()
>>> input = torch.randn(3, requires_grad=True)
>>> target = torch.empty(3).random_(2)
>>> output = loss(input, target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.CTCLoss(blank=0, reduction='mean', zero_infinity=False)[source]

Bases: _Loss

The Connectionist Temporal Classification loss.

Calculates loss between a continuous (unsegmented) time series and a target sequence. CTCLoss sums over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be “many-to-one”, which limits the length of the target sequence such that it must be \(\leq\) the input length.

Parameters:
  • blank (int, optional) – blank label. Default \(0\).

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: 'mean'

  • zero_infinity (bool, optional) – Whether to zero infinite losses and the associated gradients. Default: False Infinite losses mainly occur when the inputs are too short to be aligned to the targets.

Shape:
  • Log_probs: Tensor of size \((T, N, C)\) or \((T, C)\), where \(T = \text{input length}\), \(N = \text{batch size}\), and \(C = \text{number of classes (including blank)}\). The logarithmized probabilities of the outputs (e.g. obtained with torch.nn.functional.log_softmax()).

  • Targets: Tensor of size \((N, S)\) or \((\operatorname{sum}(\text{target\_lengths}))\), where \(N = \text{batch size}\) and \(S = \text{max target length, if shape is } (N, S)\). It represent the target sequences. Each element in the target sequence is a class index. And the target index cannot be blank (default=0). In the \((N, S)\) form, targets are padded to the length of the longest sequence, and stacked. In the \((\operatorname{sum}(\text{target\_lengths}))\) form, the targets are assumed to be un-padded and concatenated within 1 dimension.

  • Input_lengths: Tuple or tensor of size \((N)\) or \(()\), where \(N = \text{batch size}\). It represent the lengths of the inputs (must each be \(\leq T\)). And the lengths are specified for each sequence to achieve masking under the assumption that sequences are padded to equal lengths.

  • Target_lengths: Tuple or tensor of size \((N)\) or \(()\), where \(N = \text{batch size}\). It represent lengths of the targets. Lengths are specified for each sequence to achieve masking under the assumption that sequences are padded to equal lengths. If target shape is \((N,S)\), target_lengths are effectively the stop index \(s_n\) for each target sequence, such that target_n = targets[n,0:s_n] for each target in a batch. Lengths must each be \(\leq S\) If the targets are given as a 1d tensor that is the concatenation of individual targets, the target_lengths must add up to the total length of the tensor.

  • Output: scalar. If reduction is 'none', then \((N)\) if input is batched or \(()\) if input is unbatched, where \(N = \text{batch size}\).

Examples:

>>> # Target are to be padded
>>> T = 50      # Input sequence length
>>> C = 20      # Number of classes (including blank)
>>> N = 16      # Batch size
>>> S = 30      # Target sequence length of longest target in batch (padding length)
>>> S_min = 10  # Minimum target length, for demonstration purposes
>>>
>>> # Initialize random batch of input vectors, for *size = (T,N,C)
>>> input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()
>>>
>>> # Initialize random batch of targets (0 = blank, 1:C = classes)
>>> target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)
>>>
>>> input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
>>> target_lengths = torch.randint(low=S_min, high=S, size=(N,), dtype=torch.long)
>>> ctc_loss = nn.CTCLoss()
>>> loss = ctc_loss(input, target, input_lengths, target_lengths)
>>> loss.backward()
>>>
>>>
>>> # Target are to be un-padded
>>> T = 50      # Input sequence length
>>> C = 20      # Number of classes (including blank)
>>> N = 16      # Batch size
>>>
>>> # Initialize random batch of input vectors, for *size = (T,N,C)
>>> input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()
>>> input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
>>>
>>> # Initialize random batch of targets (0 = blank, 1:C = classes)
>>> target_lengths = torch.randint(low=1, high=T, size=(N,), dtype=torch.long)
>>> target = torch.randint(low=1, high=C, size=(sum(target_lengths),), dtype=torch.long)
>>> ctc_loss = nn.CTCLoss()
>>> loss = ctc_loss(input, target, input_lengths, target_lengths)
>>> loss.backward()
>>>
>>>
>>> # Target are to be un-padded and unbatched (effectively N=1)
>>> T = 50      # Input sequence length
>>> C = 20      # Number of classes (including blank)
>>>
>>> # Initialize random batch of input vectors, for *size = (T,C)
>>> # xdoctest: +SKIP("FIXME: error in doctest")
>>> input = torch.randn(T, C).log_softmax(2).detach().requires_grad_()
>>> input_lengths = torch.tensor(T, dtype=torch.long)
>>>
>>> # Initialize random batch of targets (0 = blank, 1:C = classes)
>>> target_lengths = torch.randint(low=1, high=T, size=(), dtype=torch.long)
>>> target = torch.randint(low=1, high=C, size=(target_lengths,), dtype=torch.long)
>>> ctc_loss = nn.CTCLoss()
>>> loss = ctc_loss(input, target, input_lengths, target_lengths)
>>> loss.backward()
Reference:

A. Graves et al.: Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks: https://www.cs.toronto.edu/~graves/icml_2006.pdf

Note

In order to use CuDNN, the following must be satisfied: targets must be in concatenated format, all input_lengths must be T. \(blank=0\), target_lengths \(\leq 256\), the integer arguments must be of dtype torch.int32.

The regular implementation uses the (more common in PyTorch) torch.long dtype.

Note

In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting torch.backends.cudnn.deterministic = True. Please see the notes on /notes/randomness for background.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(log_probs, targets, input_lengths, target_lengths)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

blank: int
zero_infinity: bool
class xenonpy.model.training.loss.CosineEmbeddingLoss(margin=0.0, size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the loss given input tensors \(x_1\), \(x_2\) and a Tensor label \(y\) with values 1 or -1. This is used for measuring whether two inputs are similar or dissimilar, using the cosine similarity, and is typically used for learning nonlinear embeddings or semi-supervised learning.

The loss function for each sample is:

\[\begin{split}\text{loss}(x, y) = \begin{cases} 1 - \cos(x_1, x_2), & \text{if } y = 1 \\ \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1 \end{cases}\end{split}\]
Parameters:
  • margin (float, optional) – Should be a number from \(-1\) to \(1\), \(0\) to \(0.5\) is suggested. If margin is missing, the default value is \(0\).

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input1: \((N, D)\) or \((D)\), where N is the batch size and D is the embedding dimension.

  • Input2: \((N, D)\) or \((D)\), same shape as Input1.

  • Target: \((N)\) or \(()\).

  • Output: If reduction is 'none', then \((N)\), otherwise scalar.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input1, input2, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

margin: float
class xenonpy.model.training.loss.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean', label_smoothing=0.0)[source]

Bases: _WeightedLoss

This criterion computes the cross entropy loss between input logits and target.

It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general). input has to be a Tensor of size \((C)\) for unbatched input, \((minibatch, C)\) or \((minibatch, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) for the K-dimensional case. The last being useful for higher dimension inputs, such as computing cross entropy loss per-pixel for 2D images.

The target that this criterion expects should contain either:

  • Class indices in the range \([0, C)\) where \(C\) is the number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range). The unreduced (i.e. with reduction set to 'none') loss for this case can be described as:

    \[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} \log \frac{\exp(x_{n,y_n})}{\sum_{c=1}^C \exp(x_{n,c})} \cdot \mathbb{1}\{y_n \not= \text{ignore\_index}\}\]

    where \(x\) is the input, \(y\) is the target, \(w\) is the weight, \(C\) is the number of classes, and \(N\) spans the minibatch dimension as well as \(d_1, ..., d_k\) for the K-dimensional case. If reduction is not 'none' (default 'mean'), then

    \[\begin{split}\ell(x, y) = \begin{cases} \sum_{n=1}^N \frac{1}{\sum_{n=1}^N w_{y_n} \cdot \mathbb{1}\{y_n \not= \text{ignore\_index}\}} l_n, & \text{if reduction} = \text{`mean';}\\ \sum_{n=1}^N l_n, & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

    Note that this case is equivalent to the combination of LogSoftmax and NLLLoss.

  • Probabilities for each class; useful when labels beyond a single class per minibatch item are required, such as for blended labels, label smoothing, etc. The unreduced (i.e. with reduction set to 'none') loss for this case can be described as:

    \[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - \sum_{c=1}^C w_c \log \frac{\exp(x_{n,c})}{\sum_{i=1}^C \exp(x_{n,i})} y_{n,c}\]

    where \(x\) is the input, \(y\) is the target, \(w\) is the weight, \(C\) is the number of classes, and \(N\) spans the minibatch dimension as well as \(d_1, ..., d_k\) for the K-dimensional case. If reduction is not 'none' (default 'mean'), then

    \[\begin{split}\ell(x, y) = \begin{cases} \frac{\sum_{n=1}^N l_n}{N}, & \text{if reduction} = \text{`mean';}\\ \sum_{n=1}^N l_n, & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

Note

The performance of this criterion is generally better when target contains class indices, as this allows for optimized computation. Consider providing target as class probabilities only when a single class label per minibatch item is too restrictive.

Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size C

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient. When size_average is True, the loss is averaged over non-ignored targets. Note that ignore_index is only applicable when the target contains class indices.

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the weighted mean of the output is taken, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

  • label_smoothing (float, optional) – A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing. The targets become a mixture of the original ground truth and a uniform distribution as described in Rethinking the Inception Architecture for Computer Vision. Default: \(0.0\).

Shape:
  • Input: Shape \((C)\), \((N, C)\) or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss.

  • Target: If containing class indices, shape \(()\), \((N)\) or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss where each value should be between \([0, C)\). If containing class probabilities, same shape as the input and each value should be between \([0, 1]\).

  • Output: If reduction is ‘none’, shape \(()\), \((N)\) or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss, depending on the shape of the input. Otherwise, scalar.

where:

\[\begin{split}\begin{aligned} C ={} & \text{number of classes} \\ N ={} & \text{batch size} \\ \end{aligned}\end{split}\]

Examples:

>>> # Example of target with class indices
>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.empty(3, dtype=torch.long).random_(5)
>>> output = loss(input, target)
>>> output.backward()
>>>
>>> # Example of target with class probabilities
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5).softmax(dim=1)
>>> output = loss(input, target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

ignore_index: int
label_smoothing: float
class xenonpy.model.training.loss.HingeEmbeddingLoss(margin=1.0, size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Measures the loss given an input tensor \(x\) and a labels tensor \(y\) (containing 1 or -1). This is usually used for measuring whether two inputs are similar or dissimilar, e.g. using the L1 pairwise distance as \(x\), and is typically used for learning nonlinear embeddings or semi-supervised learning.

The loss function for \(n\)-th sample in the mini-batch is

\[\begin{split}l_n = \begin{cases} x_n, & \text{if}\; y_n = 1,\\ \max \{0, \Delta - x_n\}, & \text{if}\; y_n = -1, \end{cases}\end{split}\]

and the total loss functions is

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

where \(L = \{l_1,\dots,l_N\}^\top\).

Parameters:
  • margin (float, optional) – Has a default value of 1.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((*)\) where \(*\) means, any number of dimensions. The sum operation operates over all the elements.

  • Target: \((*)\), same shape as the input

  • Output: scalar. If reduction is 'none', then same shape as the input

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

margin: float
class xenonpy.model.training.loss.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)[source]

Bases: _Loss

The Kullback-Leibler divergence loss.

For tensors of the same shape \(y_{\text{pred}},\ y_{\text{true}}\), where \(y_{\text{pred}}\) is the input and \(y_{\text{true}}\) is the target, we define the pointwise KL-divergence as

\[L(y_{\text{pred}},\ y_{\text{true}}) = y_{\text{true}} \cdot \log \frac{y_{\text{true}}}{y_{\text{pred}}} = y_{\text{true}} \cdot (\log y_{\text{true}} - \log y_{\text{pred}})\]

To avoid underflow issues when computing this quantity, this loss expects the argument input in the log-space. The argument target may also be provided in the log-space if log_target= True.

To summarise, this function is roughly equivalent to computing

if not log_target: # default
    loss_pointwise = target * (target.log() - input)
else:
    loss_pointwise = target.exp() * (target - input)

and then reducing this result depending on the argument reduction as

if reduction == "mean":  # default
    loss = loss_pointwise.mean()
elif reduction == "batchmean":  # mathematically correct
    loss = loss_pointwise.sum() / input.size(0)
elif reduction == "sum":
    loss = loss_pointwise.sum()
else:  # reduction == "none"
    loss = loss_pointwise

Note

As all the other losses in PyTorch, this function expects the first argument, input, to be the output of the model (e.g. the neural network) and the second, target, to be the observations in the dataset. This differs from the standard mathematical notation \(KL(P\ ||\ Q)\) where \(P\) denotes the distribution of the observations and \(Q\) denotes the model.

Warning

reduction= “mean” doesn’t return the true KL divergence value, please use reduction= “batchmean” which aligns with the mathematical definition. In a future release, “mean” will be changed to be the same as “batchmean”.

Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output. Default: “mean”

  • log_target (bool, optional) – Specifies whether target is the log space. Default: False

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar by default. If reduction is ‘none’, then \((*)\), same shape as the input.

Examples:

>>> import torch.nn.functional as F
>>> kl_loss = nn.KLDivLoss(reduction="batchmean")
>>> # input should be a distribution in the log space
>>> input = F.log_softmax(torch.randn(3, 5, requires_grad=True), dim=1)
>>> # Sample a batch of distributions. Usually this would come from the dataset
>>> target = F.softmax(torch.rand(3, 5), dim=1)
>>> output = kl_loss(input, target)

>>> kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
>>> log_target = F.log_softmax(torch.rand(3, 5), dim=1)
>>> output = kl_loss(input, log_target)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.L1Loss(size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the mean absolute error (MAE) between each element in the input \(x\) and target \(y\).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left| x_n - y_n \right|,\]

where \(N\) is the batch size. If reduction is not 'none' (default 'mean'), then:

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

\(x\) and \(y\) are tensors of arbitrary shapes with a total of \(n\) elements each.

The sum operation still operates over all the elements, and divides by \(n\).

The division by \(n\) can be avoided if one sets reduction = 'sum'.

Supports real-valued and complex-valued inputs.

Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar. If reduction is 'none', then \((*)\), same shape as the input.

Examples:

>>> loss = nn.L1Loss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5)
>>> output = loss(input, target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.MSELoss(size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input \(x\) and target \(y\).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left( x_n - y_n \right)^2,\]

where \(N\) is the batch size. If reduction is not 'none' (default 'mean'), then:

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

\(x\) and \(y\) are tensors of arbitrary shapes with a total of \(n\) elements each.

The mean operation still operates over all the elements, and divides by \(n\).

The division by \(n\) can be avoided if one sets reduction = 'sum'.

Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

Examples:

>>> loss = nn.MSELoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5)
>>> output = loss(input, target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.MarginRankingLoss(margin=0.0, size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the loss given inputs \(x1\), \(x2\), two 1D mini-batch or 0D Tensors, and a label 1D mini-batch or 0D Tensor \(y\) (containing 1 or -1).

If \(y = 1\) then it assumed the first input should be ranked higher (have a larger value) than the second input, and vice-versa for \(y = -1\).

The loss function for each pair of samples in the mini-batch is:

\[\text{loss}(x1, x2, y) = \max(0, -y * (x1 - x2) + \text{margin})\]
Parameters:
  • margin (float, optional) – Has a default value of \(0\).

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input1: \((N)\) or \(()\) where N is the batch size.

  • Input2: \((N)\) or \(()\), same shape as the Input1.

  • Target: \((N)\) or \(()\), same shape as the inputs.

  • Output: scalar. If reduction is 'none' and Input size is not \(()\), then \((N)\).

Examples:

>>> loss = nn.MarginRankingLoss()
>>> input1 = torch.randn(3, requires_grad=True)
>>> input2 = torch.randn(3, requires_grad=True)
>>> target = torch.randn(3).sign()
>>> output = loss(input1, input2, target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input1, input2, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

margin: float
class xenonpy.model.training.loss.MultiLabelMarginLoss(size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input \(x\) (a 2D mini-batch Tensor) and output \(y\) (which is a 2D Tensor of target class indices). For each sample in the mini-batch:

\[\text{loss}(x, y) = \sum_{ij}\frac{\max(0, 1 - (x[y[j]] - x[i]))}{\text{x.size}(0)}\]

where \(x \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}\), \(y \in \left\{0, \; \cdots , \; \text{y.size}(0) - 1\right\}\), \(0 \leq y[j] \leq \text{x.size}(0)-1\), and \(i \neq y[j]\) for all \(i\) and \(j\).

\(y\) and \(x\) must have the same size.

The criterion only considers a contiguous block of non-negative targets that starts at the front.

This allows for different samples to have variable amounts of target classes.

Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((C)\) or \((N, C)\) where N is the batch size and C is the number of classes.

  • Target: \((C)\) or \((N, C)\), label targets padded by -1 ensuring same shape as the input.

  • Output: scalar. If reduction is 'none', then \((N)\).

Examples:

>>> loss = nn.MultiLabelMarginLoss()
>>> x = torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]])
>>> # for target y, only consider labels 3 and 0, not after label -1
>>> y = torch.LongTensor([[3, 0, -1, 1]])
>>> # 0.25 * ((1-(0.1-0.2)) + (1-(0.1-0.4)) + (1-(0.8-0.2)) + (1-(0.8-0.4)))
>>> loss(x, y)
tensor(0.85...)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.MultiLabelSoftMarginLoss(weight=None, size_average=None, reduce=None, reduction='mean')[source]

Bases: _WeightedLoss

Creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input \(x\) and target \(y\) of size \((N, C)\). For each sample in the minibatch:

\[loss(x, y) = - \frac{1}{C} * \sum_i y[i] * \log((1 + \exp(-x[i]))^{-1}) + (1-y[i]) * \log\left(\frac{\exp(-x[i])}{(1 + \exp(-x[i]))}\right)\]

where \(i \in \left\{0, \; \cdots , \; \text{x.nElement}() - 1\right\}\), \(y[i] \in \left\{0, \; 1\right\}\).

Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((N, C)\) where N is the batch size and C is the number of classes.

  • Target: \((N, C)\), label targets padded by -1 ensuring same shape as the input.

  • Output: scalar. If reduction is 'none', then \((N)\).

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.MultiMarginLoss(p=1, margin=1.0, weight=None, size_average=None, reduce=None, reduction='mean')[source]

Bases: _WeightedLoss

Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input \(x\) (a 2D mini-batch Tensor) and output \(y\) (which is a 1D tensor of target class indices, \(0 \leq y \leq \text{x.size}(1)-1\)):

For each mini-batch sample, the loss in terms of the 1D input \(x\) and scalar output \(y\) is:

\[\text{loss}(x, y) = \frac{\sum_i \max(0, \text{margin} - x[y] + x[i])^p}{\text{x.size}(0)}\]

where \(i \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}\) and \(i \neq y\).

Optionally, you can give non-equal weighting on the classes by passing a 1D weight tensor into the constructor.

The loss function then becomes:

\[\text{loss}(x, y) = \frac{\sum_i \max(0, w[y] * (\text{margin} - x[y] + x[i]))^p}{\text{x.size}(0)}\]
Parameters:
  • p (int, optional) – Has a default value of \(1\). \(1\) and \(2\) are the only supported values.

  • margin (float, optional) – Has a default value of \(1\).

  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((N, C)\) or \((C)\), where \(N\) is the batch size and \(C\) is the number of classes.

  • Target: \((N)\) or \(()\), where each value is \(0 \leq \text{targets}[i] \leq C-1\).

  • Output: scalar. If reduction is 'none', then same shape as the target.

Examples:

>>> loss = nn.MultiMarginLoss()
>>> x = torch.tensor([[0.1, 0.2, 0.4, 0.8]])
>>> y = torch.tensor([3])
>>> # 0.25 * ((1-(0.8-0.1)) + (1-(0.8-0.2)) + (1-(0.8-0.4)))
>>> loss(x, y)
tensor(0.32...)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

margin: float
p: int
class xenonpy.model.training.loss.NLLLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')[source]

Bases: _WeightedLoss

The negative log likelihood loss. It is useful to train a classification problem with C classes.

If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either \((minibatch, C)\) or \((minibatch, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) for the K-dimensional case. The latter is useful for higher dimension inputs, such as computing NLL loss per-pixel for 2D images.

Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.

The target that this loss expects should be a class index in the range \([0, C-1]\) where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} x_{n,y_n}, \quad w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\},\]

where \(x\) is the input, \(y\) is the target, \(w\) is the weight, and \(N\) is the batch size. If reduction is not 'none' (default 'mean'), then

\[\begin{split}\ell(x, y) = \begin{cases} \sum_{n=1}^N \frac{1}{\sum_{n=1}^N w_{y_n}} l_n, & \text{if reduction} = \text{`mean';}\\ \sum_{n=1}^N l_n, & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]
Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: None

  • ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient. When size_average is True, the loss is averaged over non-ignored targets.

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: None

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the weighted mean of the output is taken, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((N, C)\) or \((C)\), where C = number of classes, or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss.

  • Target: \((N)\) or \(()\), where each value is \(0 \leq \text{targets}[i] \leq C-1\), or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss.

  • Output: If reduction is 'none', shape \((N)\) or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss. Otherwise, scalar.

Examples:

>>> m = nn.LogSoftmax(dim=1)
>>> loss = nn.NLLLoss()
>>> # input is of size N x C = 3 x 5
>>> input = torch.randn(3, 5, requires_grad=True)
>>> # each element in target has to have 0 <= value < C
>>> target = torch.tensor([1, 0, 4])
>>> output = loss(m(input), target)
>>> output.backward()
>>>
>>>
>>> # 2D loss example (used, for example, with image inputs)
>>> N, C = 5, 4
>>> loss = nn.NLLLoss()
>>> # input is of size N x C x height x width
>>> data = torch.randn(N, 16, 10, 10)
>>> conv = nn.Conv2d(16, C, (3, 3))
>>> m = nn.LogSoftmax(dim=1)
>>> # each element in target has to have 0 <= value < C
>>> target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
>>> output = loss(m(conv(data)), target)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

ignore_index: int
class xenonpy.model.training.loss.NLLLoss2d(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')[source]

Bases: NLLLoss

Initializes internal Module state, shared by both nn.Module and ScriptModule.

ignore_index: int
reduction: str
training: bool
weight: Optional[Tensor]
class xenonpy.model.training.loss.PoissonNLLLoss(log_input=True, full=False, size_average=None, eps=1e-08, reduce=None, reduction='mean')[source]

Bases: _Loss

Negative log likelihood loss with Poisson distribution of target.

The loss can be described as:

\[ \begin{align}\begin{aligned}\text{target} \sim \mathrm{Poisson}(\text{input})\\\text{loss}(\text{input}, \text{target}) = \text{input} - \text{target} * \log(\text{input}) + \log(\text{target!})\end{aligned}\end{align} \]

The last term can be omitted or approximated with Stirling formula. The approximation is used for target values more than 1. For targets less or equal to 1 zeros are added to the loss.

Parameters:
  • log_input (bool, optional) – if True the loss is computed as \(\exp(\text{input}) - \text{target}*\text{input}\), if False the loss is \(\text{input} - \text{target}*\log(\text{input}+\text{eps})\).

  • full (bool, optional) –

    whether to compute full loss, i. e. to add the Stirling approximation term

    \[\text{target}*\log(\text{target}) - \text{target} + 0.5 * \log(2\pi\text{target}).\]

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • eps (float, optional) – Small value to avoid evaluation of \(\log(0)\) when log_input = False. Default: 1e-8

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Examples:

>>> loss = nn.PoissonNLLLoss()
>>> log_input = torch.randn(5, 2, requires_grad=True)
>>> target = torch.randn(5, 2)
>>> output = loss(log_input, target)
>>> output.backward()
Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar by default. If reduction is 'none', then \((*)\), the same shape as the input.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(log_input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

eps: float
full: bool
log_input: bool
class xenonpy.model.training.loss.SmoothL1Loss(size_average=None, reduce=None, reduction='mean', beta=1.0)[source]

Bases: _Loss

Creates a criterion that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise. It is less sensitive to outliers than torch.nn.MSELoss and in some cases prevents exploding gradients (e.g. see the paper Fast R-CNN by Ross Girshick).

For a batch of size \(N\), the unreduced loss can be described as:

\[\ell(x, y) = L = \{l_1, ..., l_N\}^T\]

with

\[\begin{split}l_n = \begin{cases} 0.5 (x_n - y_n)^2 / beta, & \text{if } |x_n - y_n| < beta \\ |x_n - y_n| - 0.5 * beta, & \text{otherwise } \end{cases}\end{split}\]

If reduction is not none, then:

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

Note

Smooth L1 loss can be seen as exactly L1Loss, but with the \(|x - y| < beta\) portion replaced with a quadratic function such that its slope is 1 at \(|x - y| = beta\). The quadratic segment smooths the L1 loss near \(|x - y| = 0\).

Note

Smooth L1 loss is closely related to HuberLoss, being equivalent to \(huber(x, y) / beta\) (note that Smooth L1’s beta hyper-parameter is also known as delta for Huber). This leads to the following differences:

  • As beta -> 0, Smooth L1 loss converges to L1Loss, while HuberLoss converges to a constant 0 loss. When beta is 0, Smooth L1 loss is equivalent to L1 loss.

  • As beta -> \(+\infty\), Smooth L1 loss converges to a constant 0 loss, while HuberLoss converges to MSELoss.

  • For Smooth L1 loss, as beta varies, the L1 segment of the loss has a constant slope of 1. For HuberLoss, the slope of the L1 segment is beta.

Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

  • beta (float, optional) – Specifies the threshold at which to change between L1 and L2 loss. The value must be non-negative. Default: 1.0

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar. If reduction is 'none', then \((*)\), same shape as the input.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.SoftMarginLoss(size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that optimizes a two-class classification logistic loss between input tensor \(x\) and target tensor \(y\) (containing 1 or -1).

\[\text{loss}(x, y) = \sum_i \frac{\log(1 + \exp(-y[i]*x[i]))}{\text{x.nelement}()}\]
Parameters:
  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((*)\), where \(*\) means any number of dimensions.

  • Target: \((*)\), same shape as the input.

  • Output: scalar. If reduction is 'none', then \((*)\), same shape as input.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, target)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reduction: str
class xenonpy.model.training.loss.TripletMarginLoss(margin=1.0, p=2.0, eps=1e-06, swap=False, size_average=None, reduce=None, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the triplet loss given an input tensors \(x1\), \(x2\), \(x3\) and a margin with a value greater than \(0\). This is used for measuring a relative similarity between samples. A triplet is composed by a, p and n (i.e., anchor, positive examples and negative examples respectively). The shapes of all input tensors should be \((N, D)\).

The distance swap is described in detail in the paper Learning shallow convolutional feature descriptors with triplet losses by V. Balntas, E. Riba et al.

The loss function for each sample in the mini-batch is:

\[L(a, p, n) = \max \{d(a_i, p_i) - d(a_i, n_i) + {\rm margin}, 0\}\]

where

\[d(x_i, y_i) = \left\lVert {\bf x}_i - {\bf y}_i \right\rVert_p\]

See also TripletMarginWithDistanceLoss, which computes the triplet margin loss for input tensors using a custom distance function.

Parameters:
  • margin (float, optional) – Default: \(1\).

  • p (int, optional) – The norm degree for pairwise distance. Default: \(2\).

  • swap (bool, optional) – The distance swap is described in detail in the paper Learning shallow convolutional feature descriptors with triplet losses by V. Balntas, E. Riba et al. Default: False.

  • size_average (bool, optional) – Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional) – Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

Shape:
  • Input: \((N, D)\) or \((D)\) where \(D\) is the vector dimension.

  • Output: A Tensor of shape \((N)\) if reduction is 'none' and input shape is \((N, D)\); a scalar otherwise.

Examples:

>>> triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
>>> anchor = torch.randn(100, 128, requires_grad=True)
>>> positive = torch.randn(100, 128, requires_grad=True)
>>> negative = torch.randn(100, 128, requires_grad=True)
>>> output = triplet_loss(anchor, positive, negative)
>>> output.backward()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(anchor, positive, negative)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

eps: float
margin: float
p: float
swap: bool
class xenonpy.model.training.loss.TripletMarginWithDistanceLoss(*, distance_function=None, margin=1.0, swap=False, reduction='mean')[source]

Bases: _Loss

Creates a criterion that measures the triplet loss given input tensors \(a\), \(p\), and \(n\) (representing anchor, positive, and negative examples, respectively), and a nonnegative, real-valued function (“distance function”) used to compute the relationship between the anchor and positive example (“positive distance”) and the anchor and negative example (“negative distance”).

The unreduced loss (i.e., with reduction set to 'none') can be described as:

\[\ell(a, p, n) = L = \{l_1,\dots,l_N\}^\top, \quad l_i = \max \{d(a_i, p_i) - d(a_i, n_i) + {\rm margin}, 0\}\]

where \(N\) is the batch size; \(d\) is a nonnegative, real-valued function quantifying the closeness of two tensors, referred to as the distance_function; and \(margin\) is a nonnegative margin representing the minimum difference between the positive and negative distances that is required for the loss to be 0. The input tensors have \(N\) elements each and can be of any shape that the distance function can handle.

If reduction is not 'none' (default 'mean'), then:

\[\begin{split}\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}\end{split}\]

See also TripletMarginLoss, which computes the triplet loss for input tensors using the \(l_p\) distance as the distance function.

Parameters:
  • distance_function (Callable, optional) – A nonnegative, real-valued function that quantifies the closeness of two tensors. If not specified, nn.PairwiseDistance will be used. Default: None

  • margin (float, optional) – A nonnegative margin representing the minimum difference between the positive and negative distances required for the loss to be 0. Larger margins penalize cases where the negative examples are not distant enough from the anchors, relative to the positives. Default: \(1\).

  • swap (bool, optional) – Whether to use the distance swap described in the paper Learning shallow convolutional feature descriptors with triplet losses by V. Balntas, E. Riba et al. If True, and if the positive example is closer to the negative example than the anchor is, swaps the positive example and the anchor in the loss computation. Default: False.

  • reduction (str, optional) – Specifies the (optional) reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Shape:
  • Input: \((N, *)\) where \(*\) represents any number of additional dimensions as supported by the distance function.

  • Output: A Tensor of shape \((N)\) if reduction is 'none', or a scalar otherwise.

Examples:

>>> # Initialize embeddings
>>> embedding = nn.Embedding(1000, 128)
>>> anchor_ids = torch.randint(0, 1000, (1,))
>>> positive_ids = torch.randint(0, 1000, (1,))
>>> negative_ids = torch.randint(0, 1000, (1,))
>>> anchor = embedding(anchor_ids)
>>> positive = embedding(positive_ids)
>>> negative = embedding(negative_ids)
>>>
>>> # Built-in Distance Function
>>> triplet_loss = \
>>>     nn.TripletMarginWithDistanceLoss(distance_function=nn.PairwiseDistance())
>>> output = triplet_loss(anchor, positive, negative)
>>> output.backward()
>>>
>>> # Custom Distance Function
>>> def l_infinity(x1, x2):
>>>     return torch.max(torch.abs(x1 - x2), dim=1).values
>>>
>>> # xdoctest: +SKIP("FIXME: Would call backwards a second time")
>>> triplet_loss = (
>>>     nn.TripletMarginWithDistanceLoss(distance_function=l_infinity, margin=1.5))
>>> output = triplet_loss(anchor, positive, negative)
>>> output.backward()
>>>
>>> # Custom Distance Function (Lambda)
>>> triplet_loss = (
>>>     nn.TripletMarginWithDistanceLoss(
>>>         distance_function=lambda x, y: 1.0 - F.cosine_similarity(x, y)))
>>> output = triplet_loss(anchor, positive, negative)
>>> output.backward()
Reference:

V. Balntas, et al.: Learning shallow convolutional feature descriptors with triplet losses: http://www.bmva.org/bmvc/2016/papers/paper119/index.html

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(anchor, positive, negative)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

margin: float
swap: bool

xenonpy.model.training.lr_scheduler module

class xenonpy.model.training.lr_scheduler.CosineAnnealingLR(*, T_max, eta_min=0, last_epoch=-1)[source]

Bases: BaseLRScheduler

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\begin{split}\eta_{t+1} = \eta_{min} + (\eta_t - \eta_{min})\frac{1 + \cos(\frac{T_{cur+1}}{T_{max}}\pi)}{1 + \cos(\frac{T_{cur}}{T_{max}}\pi)}, T_{cur} \neq (2k+1)T_{max};\\ \eta_{t+1} = \eta_{t} + (\eta_{max} - \eta_{min})\frac{1 - \cos(\frac{1}{T_{max}}\pi)}{2}, T_{cur} = (2k+1)T_{max}.\\\end{split}\]

When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters:

T_max (int) – Maximum number of iterations. eta_min (float): Minimum learning rate. Default: 0. last_epoch (int): The index of last epoch. Default: -1.

class xenonpy.model.training.lr_scheduler.CyclicLR(*, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]

Bases: BaseLRScheduler

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

“triangular”:

A basic triangular cycle w/ no amplitude scaling.

“triangular2”:

A basic triangular cycle that scales initial amplitude by half each cycle.

“exp_range”:

A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration.

This implementation was adapted from the github repo: bckenstler/CLR

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • base_lr (float or list) – Initial learning rate which is the lower boundary in the cycle for each parameter group.

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.

  • step_size_up (int) – Number of training iterations in the increasing half of a cycle. Default: 2000

  • step_size_down (int) – Number of training iterations in the decreasing half of a cycle. If step_size_down is None, it is set to step_size_up. Default: None

  • mode (str) – One of {triangular, triangular2, exp_range}. Values correspond to policies detailed above. If scale_fn is not None, this argument is ignored. Default: ‘triangular’

  • gamma (float) – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations) Default: 1.0

  • scale_fn (function) – Custom scaling policy defined by a single argument lambda function, where 0 <= scale_fn(x) <= 1 for all x >= 0. If specified, then ‘mode’ is ignored. Default: None

  • scale_mode (str) – {‘cycle’, ‘iterations’}. Defines whether scale_fn is evaluated on cycle number or cycle iterations (training iterations since start of cycle). Default: ‘cycle’

  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True

  • base_momentum (float or list) – Initial momentum which is the lower boundary in the cycle for each parameter group. Default: 0.8

  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Default: 0.9

  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1

class xenonpy.model.training.lr_scheduler.ExponentialLR(*, gamma, last_epoch=-1)[source]

Bases: BaseLRScheduler

Decays the learning rate of each parameter group by gamma every epoch. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • gamma (float) – Multiplicative factor of learning rate decay.

  • last_epoch (int) – The index of last epoch. Default: -1.

class xenonpy.model.training.lr_scheduler.LambdaLR(*, lr_lambda, last_epoch=-1)[source]

Bases: BaseLRScheduler

Sets the learning rate of each parameter group to the initial lr times a given function. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.

  • last_epoch (int) – The index of last epoch. Default: -1.

Example

>>> # Assuming optimizer has two groups.
>>> lambda1 = lambda epoch: epoch // 30
>>> lambda2 = lambda epoch: 0.95 ** epoch
>>> scheduler = LambdaLR(lr_lambda=[lambda1, lambda2])
>>> scheduler(optimizer)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step(),
class xenonpy.model.training.lr_scheduler.MultiStepLR(*, milestones, gamma=0.1, last_epoch=-1)[source]

Bases: BaseLRScheduler

Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • milestones (list) – List of epoch indices. Must be increasing.

  • gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.

  • last_epoch (int) – The index of last epoch. Default: -1.

Example

>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 80
>>> # lr = 0.0005   if epoch >= 80
>>> scheduler = MultiStepLR(milestones=[30,80], gamma=0.1)
>>> scheduler(optimizer)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step(),
class xenonpy.model.training.lr_scheduler.ReduceLROnPlateau(*, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)[source]

Bases: BaseLRScheduler

Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

Parameters:
  • mode (str) – One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.

  • factor (float) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.

  • patience (int) – Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then. Default: 10.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

  • threshold (float) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.

  • threshold_mode (str) – One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. Default: ‘rel’.

  • cooldown (int) – Number of epochs to wait before resuming normal operation after lr has been reduced. Default: 0.

  • min_lr (float or list) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.

  • eps (float) – Minimal decay applied to lr. If the difference between new and old lr is smaller than eps, the update is ignored. Default: 1e-8.

class xenonpy.model.training.lr_scheduler.StepLR(*, step_size, gamma=0.1, last_epoch=-1)[source]

Bases: BaseLRScheduler

Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • step_size (int) – Period of learning rate decay.

  • gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.

  • last_epoch (int) – The index of last epoch. Default: -1.

Example

>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 60
>>> # lr = 0.0005   if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(step_size=30, gamma=0.1)
>>> scheduler(optimizer)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

xenonpy.model.training.optimizer module

class xenonpy.model.training.optimizer.ASGD(*, lr=0.002, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)[source]

Bases: BaseOptimizer

Implements Averaged Stochastic Gradient Descent.

It has been proposed in Acceleration of stochastic approximation by averaging.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-2)

  • lambd (float, optional) – decay term (default: 1e-4)

  • alpha (float, optional) – power for eta update (default: 0.75)

  • t0 (float, optional) – point at which to start averaging (default: 1e6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class xenonpy.model.training.optimizer.Adadelta(*, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)[source]

Bases: BaseOptimizer

Implements Adadelta algorithm.

It has been proposed in ADADELTA: An Adaptive Learning Rate Method.

Parameters:
  • rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)

  • lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class xenonpy.model.training.optimizer.Adagrad(*, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)[source]

Bases: BaseOptimizer

Implements Adagrad algorithm.

It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-2)

  • lr_decay (float, optional) – learning rate decay (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class xenonpy.model.training.optimizer.Adam(*, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]

Bases: BaseOptimizer

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

class xenonpy.model.training.optimizer.Adamax(*, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Bases: BaseOptimizer

Implements Adamax algorithm (a variant of Adam based on infinity norm).

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • lr (float, optional) – learning rate (default: 2e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class xenonpy.model.training.optimizer.LBFGS(*, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]

Bases: BaseOptimizer

Implements L-BFGS algorithm.

Warning

This optimizer doesn’t support per-parameter options and parameter groups (there can be only one).

Warning

Right now all parameters have to be on a single device. This will be improved in the future.

Note

This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.

Parameters:
  • lr (float) – learning rate (default: 1)

  • max_iter (int) – maximal number of iterations per optimization step (default: 20)

  • max_eval (int) – maximal number of function evaluations per optimization step (default: max_iter * 1.25).

  • tolerance_grad (float) – termination tolerance on first order optimality (default: 1e-5).

  • tolerance_change (float) – termination tolerance on function value/parameter changes (default: 1e-9).

  • history_size (int) – update history size (default: 100).

class xenonpy.model.training.optimizer.RMSprop(*, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]

Bases: BaseOptimizer

Implements RMSprop algorithm.

Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-2)

  • momentum (float, optional) – momentum factor (default: 0)

  • alpha (float, optional) – smoothing constant (default: 0.99)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class xenonpy.model.training.optimizer.Rprop(*, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))[source]

Bases: BaseOptimizer

Implements the resilient backpropagation algorithm.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-2)

  • etas (Tuple[float, float], optional) – pair of (etaminus, etaplis), that are multiplicative increase and decrease factors (default: (0.5, 1.2))

  • step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))

class xenonpy.model.training.optimizer.SGD(*, lr=0.01, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]

Bases: BaseOptimizer

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters:
  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

Example

>>> optimizer = torch.optim.SGD(lr=0.1, momentum=0.9)
>>> optimizer(model.parameters())
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step(),

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]

where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]

The Nesterov version is analogously modified.

class xenonpy.model.training.optimizer.SparseAdam(*, lr=0.001, betas=(0.9, 0.999), eps=1e-08)[source]

Bases: BaseOptimizer

Implements lazy version of Adam algorithm suitable for sparse tensors.

In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

Parameters:
  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

xenonpy.model.training.trainer module

class xenonpy.model.training.trainer.Trainer(*, loss_func=None, optimizer=None, model=None, lr_scheduler=None, clip_grad=None, epochs=200, cuda=False, non_blocking=False)[source]

Bases: BaseRunner

NN model trainer.

Parameters:
checkpoint_tuple

alias of checkpoint

results_tuple

alias of results

__call__(x_train=None, y_train=None, x_val=None, y_val=None, *, training_dataset=None, validation_dataset=None, epochs=None, checkpoint=None, **model_params)[source]

Train the Neural Network model

Parameters:
  • x_train (Union[Tensor, Tuple[Tensor], None]) – Training data. Will be ignored will training_dataset is given.

  • y_train (Optional[Any]) – Test data. Will be ignored will training_dataset is given.

  • training_dataset (DataLoader) – Torch DataLoader. If given, will only use this as training dataset. When loop over this dataset, it should yield a tuple contains x_train and y_train in order.

  • x_val (Union[Any, Tuple[Any]]) – Data for validation.

  • y_val (Any) – Data for validation.

  • validation_dataset (DataLoader) –

  • epochs (int) – Epochs. If not None, it will overwrite self.epochs temporarily.

  • checkpoint (Union[bool, int, Callable[[int], bool]]) – If True, will save model states at each step. If int, will save model states every checkpoint steps. If Callable, the function should take current total_epochs as input return bool.

  • model_params (dict) – Other model parameters.

Yields:

namedtuple

early_stop(msg)[source]
fit(x_train=None, y_train=None, x_val=None, y_val=None, *, training_dataset=None, validation_dataset=None, epochs=None, checkpoint=None, progress_bar='auto', **model_params)[source]

Train the Neural Network model

Parameters:
  • x_train (Union[torch.Tensor, Tuple[torch.Tensor]]) – Training data. Will be ignored will``training_dataset`` is given.

  • y_train (torch.Tensor) – Test data. Will be ignored will``training_dataset`` is given.

  • training_dataset (DataLoader) – Torch DataLoader. If given, will only use this as training dataset. When loop over this dataset, it should yield a tuple contains x_train and y_train in order.

  • x_val (Union[Any, Tuple[Any]]) – Data for validation.

  • y_val (Any) – Data for validation.

  • validation_dataset (DataLoader) –

  • epochs (int) – Epochs. If not None, it will overwrite self.epochs temporarily.

  • checkpoint (Union[bool, int, Callable[[int], bool]]) – If True, will save model states at each step. If int, will save model states every checkpoint steps. If Callable, the function should take current total_epochs as input return bool.

  • progress_bar (Optional[str]) – Show progress bar when for training steps. Can be ‘auto’, ‘console’, or None. See https://pypi.org/project/tqdm/#ipython-jupyter-integration for more details.

  • model_params (dict) – Other model parameters.

classmethod from_checker(checker, *, loss_func=None, optimizer=None, lr_scheduler=None, clip_grad=None, epochs=200, cuda=False, non_blocking=False)[source]

Load model for local path or xenonpy.model.training.Checker.

Parameters:
Return type:

Trainer

get_checkpoint(checkpoint=None)[source]
classmethod load(cls, from_, *, loss_func=None, optimizer=None, lr_scheduler=None, clip_grad=None, epochs=200, cuda=False, non_blocking=False)[source]
Return type:

Trainer

predict(x_in=None, y_true=None, *, dataset=None, checkpoint=None, **model_params)[source]

Predict from x input. This is just a simple wrapper for predict().

Parameters:
Returns:

Predict results. If y_true is not None, y_true will be returned in second.

Return type:

ret

reset(*, to=None, remove_checkpoints=True)[source]

Reset trainer. This will reset all trainer states and drop all training step information.

Parameters:
  • to (Union[bool, Module]) – Bind trainer to the given model or reset current model to it’s initialization states.

  • remove_checkpoints (bool) – Set to True to remove all checkpoints when resetting. Default true.

set_checkpoint(id_=None)[source]
to_namedtuple()[source]
property checkpoints
property clip_grad
property device
property epochs
property loss_func
property loss_type
property lr_scheduler
property model
property non_blocking
property optimizer
property timer
property total_epochs
property total_iterations
property training_info
property validate_dataset
property x_val
property y_val

Module contents