neuralop.training
.Trainer
- class neuralop.training.Trainer(*, model: Module, n_epochs: int, wandb_log: bool = False, device: str = 'cpu', mixed_precision: bool = False, data_processor: Module = None, eval_interval: int = 1, log_output: bool = False, use_distributed: bool = False, verbose: bool = False)[source]
A general Trainer class to train neural-operators on given datasets
Methods
checkpoint
(save_dir)checkpoint saves current training state to a directory for resuming later.
eval_one_batch
(sample, eval_losses[, ...])eval_one_batch runs inference on one batch and returns eval_losses for that batch.
evaluate
(loss_dict, data_loader[, ...])Evaluates the model on a dictionary of losses
log_eval
(epoch, eval_metrics)log_eval logs outputs from evaluation on all test loaders to stdout and wandb
log_training
(epoch, time, avg_loss, train_err)Basic method to log results from a single training epoch.
on_epoch_start
(epoch)on_epoch_start runs at the beginning of each training epoch.
resume_state_from_dir
(save_dir)Resume training from save_dir created by neuralop.training.save_training_state
train
(train_loader, test_loaders, optimizer, ...)Trains the given model on the given dataset.
train_one_batch
(idx, sample, training_loss)Run one batch of input through model
train_one_epoch
(epoch, train_loader, ...)train_one_epoch trains self.model on train_loader for one epoch and returns training metrics
evaluate_all
- train(train_loader, test_loaders, optimizer, scheduler, regularizer=None, training_loss=None, eval_losses=None, save_every: int = None, save_best: int = None, save_dir: str | Path = './ckpt', resume_from_dir: str | Path = None)[source]
Trains the given model on the given dataset.
If a device is provided, the model and data processor are loaded to device here.
- Parameters:
- train_loader: torch.utils.data.DataLoader
training dataloader
- test_loaders: dict[torch.utils.data.DataLoader]
testing dataloaders
- optimizer: torch.optim.Optimizer
optimizer to use during training
- scheduler: torch.optim.lr_scheduler
learning rate scheduler to use during training
- training_loss: training.losses function
cost function to minimize
- eval_losses: dict[Loss]
dict of losses to use in self.eval()
- save_every: int, optional, default is None
if provided, interval at which to save checkpoints
- save_best: str, optional, default is None
if provided, key of metric f”{loader_name}_{loss_name}” to monitor and save model with best eval result Overrides save_every and saves on eval_interval
- save_dir: str | Path, default “./ckpt”
directory at which to save training states if save_every and/or save_best is provided
- resume_from_dir: str | Path, default None
if provided, resumes training state (model, optimizer, regularizer, scheduler) from state saved in resume_from_dir
- Returns:
- all_metrics: dict
dictionary keyed f”{loader_name}_{loss_name}” of metric results for last validation epoch across all test_loaders
- train_one_epoch(epoch, train_loader, training_loss)[source]
train_one_epoch trains self.model on train_loader for one epoch and returns training metrics
- Parameters:
- epochint
epoch number
- train_loadertorch.utils.data.DataLoader
data loader of train examples
- test_loadersdict
dict of test torch.utils.data.DataLoader objects
- Returns:
- all_errors
dict of all eval metrics for the last epoch
- evaluate(loss_dict, data_loader, log_prefix='', epoch=None)[source]
Evaluates the model on a dictionary of losses
- Parameters:
- loss_dictdict of functions
each function takes as input a tuple (prediction, ground_truth) and returns the corresponding loss
- data_loaderdata_loader to evaluate on
- log_prefixstr, default is ‘’
if not ‘’, used as prefix in output dictionary
- epochint | None
current epoch. Used when logging both train and eval default None
- Returns
- ——-
- errorsdict
dict[f’{log_prefix}_{loss_name}] = loss for loss in loss_dict
- on_epoch_start(epoch)[source]
on_epoch_start runs at the beginning of each training epoch. This method is a stub that can be overwritten in more complex cases.
- Parameters:
- epochint
index of epoch
- Returns:
- None
- train_one_batch(idx, sample, training_loss)[source]
- Run one batch of input through model
and return training loss on outputs
- Parameters:
- idxint
index of batch within train_loader
- sampledict
data dictionary holding one batch
- Returns:
- loss: float | Tensor
float value of training loss
- eval_one_batch(sample: dict, eval_losses: dict, return_output: bool = False)[source]
eval_one_batch runs inference on one batch and returns eval_losses for that batch.
- Parameters:
- sampledict
data batch dictionary
- eval_lossesdict
dictionary of named eval metrics
- return_outputsbool
whether to return model outputs for plotting by default False
- Returns
- ——-
- eval_step_lossesdict
keyed “loss_name”: step_loss_value for each loss name
- outputs: torch.Tensor | None
optionally returns batch outputs
- log_training(epoch: int, time: float, avg_loss: float, train_err: float, avg_lasso_loss: float = None, lr: float = None)[source]
Basic method to log results from a single training epoch.
- Parameters:
- epoch: int
- time: float
training time of epoch
- avg_loss: float
average train_err per individual sample
- train_err: float
train error for entire epoch
- avg_lasso_loss: float
average lasso loss from regularizer, optional
- lr: float
learning rate at current epoch
- log_eval(epoch: int, eval_metrics: dict)[source]
log_eval logs outputs from evaluation on all test loaders to stdout and wandb
- Parameters:
- epochint
current training epoch
- eval_metricsdict
metrics collected during evaluation keyed f”{test_loader_name}_{metric}” for each test_loader
- resume_state_from_dir(save_dir)[source]
Resume training from save_dir created by neuralop.training.save_training_state
- checkpoint(save_dir)[source]
checkpoint saves current training state to a directory for resuming later. Only saves training state on the first GPU. See neuralop.training.training_state
- Parameters:
- save_dirstr | Path
directory in which to save training state