neuralop.training.Trainer

class neuralop.training.Trainer(*, model: Module, n_epochs: int, wandb_log: bool = False, device: str = 'cpu', mixed_precision: bool = False, data_processor: Module = None, eval_interval: int = 1, log_output: bool = False, use_distributed: bool = False, verbose: bool = False)[source]

A general Trainer class to train neural-operators on given datasets

Methods

checkpoint(save_dir)

checkpoint saves current training state to a directory for resuming later.

eval_one_batch(sample, eval_losses[, ...])

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

evaluate(loss_dict, data_loader[, ...])

Evaluates the model on a dictionary of losses

log_eval(epoch, eval_metrics)

log_eval logs outputs from evaluation on all test loaders to stdout and wandb

log_training(epoch, time, avg_loss, train_err)

Basic method to log results from a single training epoch.

on_epoch_start(epoch)

on_epoch_start runs at the beginning of each training epoch.

resume_state_from_dir(save_dir)

Resume training from save_dir created by neuralop.training.save_training_state

train(train_loader, test_loaders, optimizer, ...)

Trains the given model on the given dataset.

train_one_batch(idx, sample, training_loss)

Run one batch of input through model

train_one_epoch(epoch, train_loader, ...)

train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

evaluate_all

train(train_loader, test_loaders, optimizer, scheduler, regularizer=None, training_loss=None, eval_losses=None, save_every: int = None, save_best: int = None, save_dir: str | Path = './ckpt', resume_from_dir: str | Path = None)[source]

Trains the given model on the given dataset.

If a device is provided, the model and data processor are loaded to device here.

Parameters:
train_loader: torch.utils.data.DataLoader

training dataloader

test_loaders: dict[torch.utils.data.DataLoader]

testing dataloaders

optimizer: torch.optim.Optimizer

optimizer to use during training

scheduler: torch.optim.lr_scheduler

learning rate scheduler to use during training

training_loss: training.losses function

cost function to minimize

eval_losses: dict[Loss]

dict of losses to use in self.eval()

save_every: int, optional, default is None

if provided, interval at which to save checkpoints

save_best: str, optional, default is None

if provided, key of metric f”{loader_name}_{loss_name}” to monitor and save model with best eval result Overrides save_every and saves on eval_interval

save_dir: str | Path, default “./ckpt”

directory at which to save training states if save_every and/or save_best is provided

resume_from_dir: str | Path, default None

if provided, resumes training state (model, optimizer, regularizer, scheduler) from state saved in resume_from_dir

Returns:
all_metrics: dict

dictionary keyed f”{loader_name}_{loss_name}” of metric results for last validation epoch across all test_loaders

train_one_epoch(epoch, train_loader, training_loss)[source]

train_one_epoch trains self.model on train_loader for one epoch and returns training metrics

Parameters:
epochint

epoch number

train_loadertorch.utils.data.DataLoader

data loader of train examples

test_loadersdict

dict of test torch.utils.data.DataLoader objects

Returns:
all_errors

dict of all eval metrics for the last epoch

evaluate(loss_dict, data_loader, log_prefix='', epoch=None)[source]

Evaluates the model on a dictionary of losses

Parameters:
loss_dictdict of functions

each function takes as input a tuple (prediction, ground_truth) and returns the corresponding loss

data_loaderdata_loader to evaluate on
log_prefixstr, default is ‘’

if not ‘’, used as prefix in output dictionary

epochint | None

current epoch. Used when logging both train and eval default None

Returns
——-
errorsdict

dict[f’{log_prefix}_{loss_name}] = loss for loss in loss_dict

on_epoch_start(epoch)[source]

on_epoch_start runs at the beginning of each training epoch. This method is a stub that can be overwritten in more complex cases.

Parameters:
epochint

index of epoch

Returns:
None
train_one_batch(idx, sample, training_loss)[source]
Run one batch of input through model

and return training loss on outputs

Parameters:
idxint

index of batch within train_loader

sampledict

data dictionary holding one batch

Returns:
loss: float | Tensor

float value of training loss

eval_one_batch(sample: dict, eval_losses: dict, return_output: bool = False)[source]

eval_one_batch runs inference on one batch and returns eval_losses for that batch.

Parameters:
sampledict

data batch dictionary

eval_lossesdict

dictionary of named eval metrics

return_outputsbool

whether to return model outputs for plotting by default False

Returns
——-
eval_step_lossesdict

keyed “loss_name”: step_loss_value for each loss name

outputs: torch.Tensor | None

optionally returns batch outputs

log_training(epoch: int, time: float, avg_loss: float, train_err: float, avg_lasso_loss: float = None, lr: float = None)[source]

Basic method to log results from a single training epoch.

Parameters:
epoch: int
time: float

training time of epoch

avg_loss: float

average train_err per individual sample

train_err: float

train error for entire epoch

avg_lasso_loss: float

average lasso loss from regularizer, optional

lr: float

learning rate at current epoch

log_eval(epoch: int, eval_metrics: dict)[source]

log_eval logs outputs from evaluation on all test loaders to stdout and wandb

Parameters:
epochint

current training epoch

eval_metricsdict

metrics collected during evaluation keyed f”{test_loader_name}_{metric}” for each test_loader

resume_state_from_dir(save_dir)[source]

Resume training from save_dir created by neuralop.training.save_training_state

checkpoint(save_dir)[source]

checkpoint saves current training state to a directory for resuming later. Only saves training state on the first GPU. See neuralop.training.training_state

Parameters:
save_dirstr | Path

directory in which to save training state