neuralop.training.Trainer

class neuralop.training.Trainer(*, model: Module, n_epochs: int, wandb_log: bool = False, device: str = 'cpu', mixed_precision: bool = False, data_processor: Module = None, eval_interval: int = 1, log_output: bool = False, use_distributed: bool = False, verbose: bool = False)[source]

A general Trainer class to train neural-operators on given datasets.

Note

Our Trainer expects datasets to provide batches as key-value dictionaries, ex.: {'x': x, 'y': y}, that are keyed to the arguments expected by models and losses. For specifics and an example, check neuralop.data.datasets.DarcyDataset.

Parameters:
modelnn.Module
n_epochsint
wandb_logbool, default is False

whether to log results to wandb

devicetorch.device, or str ‘cpu’ or ‘cuda’
mixed_precisionbool, default is False

whether to use torch.autocast to compute mixed precision

data_processorDataProcessor class to transform data, default is None

if not None, data from the loaders is transform first with data_processor.preprocess, then after getting an output from the model, that is transformed with data_processor.postprocess.

eval_intervalint, default is 1

how frequently to evaluate model and log training stats

log_outputbool, default is False

if True, and if wandb_log is also True, log output images to wandb

use_distributedbool, default is False

whether to use DDP

verbosebool, default is False

Methods

checkpoint(save_dir)

checkpoint saves current training state to a directory for resuming later.

eval_one_batch(sample, eval_losses[, ...])

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

eval_one_batch_autoreg(sample, eval_losses)

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

evaluate(loss_dict, data_loader[, ...])

Evaluates the model on a dictionary of losses

evaluate_all(epoch, eval_losses, ...[, ...])

evaluate_all iterates through the entire dict of test_loaders to perform evaluation on the whole dataset stored in each one.

log_eval(epoch, eval_metrics)

log_eval logs outputs from evaluation on all test loaders to stdout and wandb

log_training(epoch, time, avg_loss, train_err)

Basic method to log results from a single training epoch.

on_epoch_start(epoch)

on_epoch_start runs at the beginning of each training epoch.

resume_state_from_dir(save_dir)

Resume training from save_dir created by neuralop.training.save_training_state

train(train_loader, test_loaders, optimizer, ...)

Trains the given model on the given dataset.

train_one_batch(idx, sample, training_loss)

Run one batch of input through model

train_one_epoch(epoch, train_loader, ...)

train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

train(train_loader, test_loaders, optimizer, scheduler, regularizer=None, training_loss=None, eval_losses=None, eval_modes=None, save_every: int = None, save_best: int = None, save_dir: str | Path = './ckpt', resume_from_dir: str | Path = None, max_autoregressive_steps: int = None)[source]

Trains the given model on the given dataset.

If a device is provided, the model and data processor are loaded to device here.

Parameters:
train_loader: torch.utils.data.DataLoader

training dataloader

test_loaders: dict[torch.utils.data.DataLoader]

testing dataloaders

optimizer: torch.optim.Optimizer

optimizer to use during training

scheduler: torch.optim.lr_scheduler

learning rate scheduler to use during training

training_loss: training.losses function

cost function to minimize

eval_losses: dict[Loss]

dict of losses to use in self.eval()

eval_modes: dict[str], optional

optional mapping from the name of each loader to its evaluation mode.

  • if ‘single_step’, predicts one input-output pair and evaluates loss.

  • if ‘autoregressive’, autoregressively predicts output using last step’s

output as input for a number of steps defined by the temporal dimension of the batch. This requires specially batched data with a data processor whose .preprocess and .postprocess both take idx as an argument.

save_every: int, optional, default is None

if provided, interval at which to save checkpoints

save_best: str, optional, default is None

if provided, key of metric f”{loader_name}_{loss_name}” to monitor and save model with best eval result Overrides save_every and saves on eval_interval

save_dir: str | Path, default “./ckpt”

directory at which to save training states if save_every and/or save_best is provided

resume_from_dir: str | Path, default None

if provided, resumes training state (model, optimizer, regularizer, scheduler) from state saved in resume_from_dir

max_autoregressive_stepsint, default None

if provided, and a dataloader is to be evaluated in autoregressive mode, limits the number of autoregressive in each rollout to be performed.

Returns:
all_metrics: dict

dictionary keyed f”{loader_name}_{loss_name}” of metric results for last validation epoch across all test_loaders

train_one_epoch(epoch, train_loader, training_loss)[source]

train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

Parameters:
epochint

epoch number

train_loadertorch.utils.data.DataLoader

data loader of train examples

test_loadersdict

dict of test torch.utils.data.DataLoader objects

Returns:
all_errors

dict of all eval metrics for the last epoch

evaluate_all(epoch, eval_losses, test_loaders, eval_modes, max_autoregressive_steps=None)[source]

evaluate_all iterates through the entire dict of test_loaders to perform evaluation on the whole dataset stored in each one.

Parameters:
epochint

current training epoch

eval_lossesdict[Loss]

keyed loss_name: loss_obj for each pair. Full set of losses to use in evaluation for each test loader.

test_loadersdict[DataLoader]

keyed loader_name: loader for each test loader.

eval_modesdict[str], optional

keyed loader_name: eval_mode for each test loader. * If eval_modes.get(loader_name) does not return a value, the evaluation is automatically performed in single_step mode.

max_autoregressive_stepsint, optional

if provided, and one of the test loaders has eval_mode == "autoregressive", limits the number of autoregressive steps performed per rollout.

Returns:
all_metrics: dict

collected eval metrics for each loader.

evaluate(loss_dict, data_loader, log_prefix='', epoch=None, mode='single_step', max_steps=None)[source]

Evaluates the model on a dictionary of losses

Parameters:
loss_dictdict of functions

each function takes as input a tuple (prediction, ground_truth) and returns the corresponding loss

data_loaderdata_loader to evaluate on
log_prefixstr, default is ‘’

if not ‘’, used as prefix in output dictionary

epochint | None

current epoch. Used when logging both train and eval default None

modeLiteral {‘single_step’, ‘autoregression’}

if ‘single_step’, performs standard evaluation if ‘autoregression’ loops through max_steps steps

max_stepsint, optional

max number of steps for autoregressive rollout. If None, runs the full rollout.

Returns
——-
errorsdict

dict[f’{log_prefix}_{loss_name}] = loss for loss in loss_dict

on_epoch_start(epoch)[source]

on_epoch_start runs at the beginning of each training epoch. This method is a stub that can be overwritten in more complex cases.

Parameters:
epochint

index of epoch

Returns:
None
train_one_batch(idx, sample, training_loss)[source]
Run one batch of input through model

and return training loss on outputs

Parameters:
idxint

index of batch within train_loader

sampledict

data dictionary holding one batch

Returns:
loss: float | Tensor

float value of training loss

eval_one_batch(sample: dict, eval_losses: dict, return_output: bool = False)[source]

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

Parameters:
sampledict

data batch dictionary

eval_lossesdict

dictionary of named eval metrics

return_outputsbool

whether to return model outputs for plotting by default False

Returns
——-
eval_step_lossesdict

keyed “loss_name”: step_loss_value for each loss name

outputs: torch.Tensor | None

optionally returns batch outputs

eval_one_batch_autoreg(sample: dict, eval_losses: dict, return_output: bool = False, max_steps: int = None)[source]

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

Parameters:
sampledict

data batch dictionary

eval_lossesdict

dictionary of named eval metrics

return_outputsbool

whether to return model outputs for plotting by default False

max_steps: int

number of timesteps to roll out typically the full trajectory length If max_steps is none, runs until the full length

Note

If a value for max_steps is not provided, a data_processor must be provided to handle rollout logic.

Returns
——-
eval_step_lossesdict

keyed “loss_name”: step_loss_value for each loss name

outputs: torch.Tensor | None

optionally returns batch outputs

log_training(epoch: int, time: float, avg_loss: float, train_err: float, avg_lasso_loss: float = None, lr: float = None)[source]

Basic method to log results from a single training epoch.

Parameters:
epoch: int
time: float

training time of epoch

avg_loss: float

average train_err per individual sample

train_err: float

train error for entire epoch

avg_lasso_loss: float

average lasso loss from regularizer, optional

lr: float

learning rate at current epoch

log_eval(epoch: int, eval_metrics: dict)[source]

log_eval logs outputs from evaluation on all test loaders to stdout and wandb

Parameters:
epochint

current training epoch

eval_metricsdict

metrics collected during evaluation keyed f”{test_loader_name}_{metric}” for each test_loader

resume_state_from_dir(save_dir)[source]

Resume training from save_dir created by neuralop.training.save_training_state

checkpoint(save_dir)[source]

checkpoint saves current training state to a directory for resuming later. Only saves training state on the first GPU. See neuralop.training.training_state

Parameters:
save_dirstr | Path

directory in which to save training state