`neuralop.training`.Trainer

class neuralop.training.Trainer(*, model: Module, n_epochs: int, wandb_log: bool = False, device: str = 'cpu', mixed_precision: bool = False, data_processor: Module = None, eval_interval: int = 1, log_output: bool = False, use_distributed: bool = False, verbose: bool = False)[source]

A general Trainer class to train neural-operators on given datasets.

Note

Our Trainer expects datasets to provide batches as key-value dictionaries, ex.: {'x': x, 'y': y}, that are keyed to the arguments expected by models and losses. For specifics and an example, check neuralop.data.datasets.DarcyDataset.

Parameters:

modelnn.Module
n_epochsint
wandb_logbool, default is False: whether to log results to wandb
devicetorch.device, or str ‘cpu’ or ‘cuda’
mixed_precisionbool, default is False: whether to use torch.autocast to compute mixed precision
data_processorDataProcessor class to transform data, default is None: if not None, data from the loaders is transform first with data_processor.preprocess, then after getting an output from the model, that is transformed with data_processor.postprocess.
eval_intervalint, default is 1: how frequently to evaluate model and log training stats
log_outputbool, default is False: if True, and if wandb_log is also True, log output images to wandb
use_distributedbool, default is False: whether to use DDP
verbosebool, default is False

Methods

`checkpoint`(save_dir)	checkpoint saves current training state to a directory for resuming later.
`eval_one_batch`(sample, eval_losses[, ...])	eval_one_batch runs inference on one batch and returns eval_losses for that batch.
`eval_one_batch_autoreg`(sample, eval_losses)	eval_one_batch runs inference on one batch and returns eval_losses for that batch.
`evaluate`(loss_dict, data_loader[, ...])	Evaluates the model on a dictionary of losses
`evaluate_all`(epoch, eval_losses, ...[, ...])	evaluate_all iterates through the entire dict of test_loaders to perform evaluation on the whole dataset stored in each one.
`log_eval`(epoch, eval_metrics)	log_eval logs outputs from evaluation on all test loaders to stdout and wandb
`log_training`(epoch, time, avg_loss, train_err)	Basic method to log results from a single training epoch.
`on_epoch_start`(epoch)	on_epoch_start runs at the beginning of each training epoch.
`resume_state_from_dir`(save_dir)	Resume training from save_dir created by neuralop.training.save_training_state
`train`(train_loader, test_loaders, optimizer, ...)	Trains the given model on the given dataset.
`train_one_batch`(idx, sample, training_loss)	Run one batch of input through model
`train_one_epoch`(epoch, train_loader, ...)	train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

train(train_loader, test_loaders, optimizer, scheduler, regularizer=None, training_loss=None, eval_losses=None, eval_modes=None, save_every: int = None, save_best: int = None, save_dir: str | Path = './ckpt', resume_from_dir: str | Path = None, max_autoregressive_steps: int = None)[source]

Trains the given model on the given dataset.

If a device is provided, the model and data processor are loaded to device here.

Parameters:

train_loader: torch.utils.data.DataLoader

training dataloader

test_loaders: dict[torch.utils.data.DataLoader]

testing dataloaders

optimizer: torch.optim.Optimizer

optimizer to use during training

scheduler: torch.optim.lr_scheduler

learning rate scheduler to use during training

training_loss: training.losses function

cost function to minimize

eval_losses: dict[Loss]

dict of losses to use in self.eval()

eval_modes: dict[str], optional

optional mapping from the name of each loader to its evaluation mode.

if ‘single_step’, predicts one input-output pair and evaluates loss.
if ‘autoregressive’, autoregressively predicts output using last step’s

output as input for a number of steps defined by the temporal dimension of the batch. This requires specially batched data with a data processor whose .preprocess and .postprocess both take idx as an argument.

save_every: int, optional, default is None

if provided, interval at which to save checkpoints

save_best: str, optional, default is None

if provided, key of metric f”{loader_name}_{loss_name}” to monitor and save model with best eval result Overrides save_every and saves on eval_interval

save_dir: str | Path, default “./ckpt”

directory at which to save training states if save_every and/or save_best is provided

resume_from_dir: str | Path, default None

if provided, resumes training state (model, optimizer, regularizer, scheduler) from state saved in resume_from_dir

max_autoregressive_stepsint, default None

if provided, and a dataloader is to be evaluated in autoregressive mode, limits the number of autoregressive in each rollout to be performed.

Returns:

all_metrics: dict: dictionary keyed f”{loader_name}_{loss_name}” of metric results for last validation epoch across all test_loaders

train_one_epoch(epoch, train_loader, training_loss)[source]

train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

Parameters:

epochint: epoch number
train_loadertorch.utils.data.DataLoader: data loader of train examples
test_loadersdict: dict of test torch.utils.data.DataLoader objects

Returns:

all_errors: dict of all eval metrics for the last epoch

evaluate_all(epoch, eval_losses, test_loaders, eval_modes, max_autoregressive_steps=None)[source]

evaluate_all iterates through the entire dict of test_loaders to perform evaluation on the whole dataset stored in each one.

Parameters:

epochint: current training epoch
eval_lossesdict[Loss]: keyed loss_name: loss_obj for each pair. Full set of losses to use in evaluation for each test loader.
test_loadersdict[DataLoader]: keyed loader_name: loader for each test loader.
eval_modesdict[str], optional: keyed loader_name: eval_mode for each test loader. * If eval_modes.get(loader_name) does not return a value, the evaluation is automatically performed in single_step mode.
max_autoregressive_stepsint, optional: if provided, and one of the test loaders has eval_mode == "autoregressive", limits the number of autoregressive steps performed per rollout.

Returns:

all_metrics: dict: collected eval metrics for each loader.

evaluate(loss_dict, data_loader, log_prefix='', epoch=None, mode='single_step', max_steps=None)[source]

Evaluates the model on a dictionary of losses

Parameters:

loss_dictdict of functions: each function takes as input a tuple (prediction, ground_truth) and returns the corresponding loss
data_loaderdata_loader to evaluate on
log_prefixstr, default is ‘’: if not ‘’, used as prefix in output dictionary
epochint | None: current epoch. Used when logging both train and eval default None
modeLiteral {‘single_step’, ‘autoregression’}: if ‘single_step’, performs standard evaluation if ‘autoregression’ loops through max_steps steps
max_stepsint, optional: max number of steps for autoregressive rollout. If None, runs the full rollout.
Returns
——-
errorsdict: dict[f’{log_prefix}_{loss_name}] = loss for loss in loss_dict

on_epoch_start(epoch)[source]

on_epoch_start runs at the beginning of each training epoch. This method is a stub that can be overwritten in more complex cases.

Parameters:

epochint: index of epoch

Returns:

None

train_one_batch(idx, sample, training_loss)[source]

Run one batch of input through model: and return training loss on outputs

Parameters:

idxint: index of batch within train_loader
sampledict: data dictionary holding one batch

Returns:

loss: float | Tensor: float value of training loss

eval_one_batch(sample: dict, eval_losses: dict, return_output: bool = False)[source]

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

Parameters:

sampledict: data batch dictionary
eval_lossesdict: dictionary of named eval metrics
return_outputsbool: whether to return model outputs for plotting by default False
Returns
——-
eval_step_lossesdict: keyed “loss_name”: step_loss_value for each loss name
outputs: torch.Tensor | None: optionally returns batch outputs

eval_one_batch_autoreg(sample: dict, eval_losses: dict, return_output: bool = False, max_steps: int = None)[source]

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

Parameters:

sampledict: data batch dictionary
eval_lossesdict: dictionary of named eval metrics
return_outputsbool: whether to return model outputs for plotting by default False
max_steps: int: number of timesteps to roll out typically the full trajectory length If max_steps is none, runs until the full length

Note

If a value for max_steps is not provided, a data_processor must be provided to handle rollout logic.
Returns
——-
eval_step_lossesdict: keyed “loss_name”: step_loss_value for each loss name
outputs: torch.Tensor | None: optionally returns batch outputs

log_training(epoch: int, time: float, avg_loss: float, train_err: float, avg_lasso_loss: float = None, lr: float = None)[source]

Basic method to log results from a single training epoch.

Parameters:

epoch: int
time: float: training time of epoch
avg_loss: float: average train_err per individual sample
train_err: float: train error for entire epoch
avg_lasso_loss: float: average lasso loss from regularizer, optional
lr: float: learning rate at current epoch

log_eval(epoch: int, eval_metrics: dict)[source]

log_eval logs outputs from evaluation on all test loaders to stdout and wandb

Parameters:

epochint: current training epoch
eval_metricsdict: metrics collected during evaluation keyed f”{test_loader_name}_{metric}” for each test_loader

resume_state_from_dir(save_dir)[source]: Resume training from save_dir created by neuralop.training.save_training_state

checkpoint(save_dir)[source]

checkpoint saves current training state to a directory for resuming later. Only saves training state on the first GPU. See neuralop.training.training_state

Parameters:

save_dirstr | Path: directory in which to save training state