Training neural operator models

Our library makes it easy for anyone with data drawn from a system governed by a PDE to train and test Neural Operator models. This page details the library’s Python interface for training and evaluation of NOs.

The Trainer class

Most users will train neural operator models on their own data in very similar ways, using a very standard machine learning training loop. To speed up this process, we provide a Trainer class that automates much of this boilerplate logic. Things like loading a model to device, zeroing gradients and computing most loss functions are taken care of.

The Trainer implements training in a modular fashion, meaning that more domain-specific logic can easily be implemented. For more specific documentation, check the API reference.

Distributed Training

We also provide a simple way to use PyTorch’s DistributedDataParallel functionality to hold data across multiple GPUs. We use PyTorch’s torchrun elastic launcher, so all you need to do on a multi-GPU system is the following:

torchrun --standalone --nproc_per_node <NUM_GPUS> script.py

You may need to adjust the batch size, model parallel size and world size in accordance with your specific use case. See the torchrun documentation for more details.