Training neural operator models
Our library makes it easy for anyone with data drawn from a system governed by a PDE to train and test Neural Operator models. This page details the library’s Python interface for training and evaluation of NOs.
The Trainer class
Most users will train neural operator models on their own data in very similar ways,
using a very standard machine learning training loop. To speed up this process, we
provide a Trainer
class that automates much of this boilerplate logic.
Things like loading a model to device, zeroing gradients and computing most loss
functions are taken care of.
The Trainer
implements training in a modular fashion, meaning that more domain-specific logic
can easily be implemented. For more specific documentation, check the API reference.
Distributed Training
We also provide a simple way to use PyTorch’s DistributedDataParallel
functionality to hold data across multiple GPUs. We use PyTorch’s torchrun
elastic launcher,
so all you need to do on a multi-GPU system is the following:
torchrun --standalone --nproc_per_node <NUM_GPUS> script.py
You may need to adjust the batch size, model parallel size and world size in accordance with your specific use case. See the torchrun documentation for more details.