Now introducing the BOLT (Big Ol’ Layer Training) Engine: Orders of magnitude faster neural network training on any CPU (Intel, AMD, ARM).
Bolt Engine on Desktops can be order of magnitude faster than GPU acceleration.
Training 1.6 Billion parameter Model on refurbished old CPUs can be 2x faster than A100.
As explained in our other article, larger models need larger batch sizes for speed and generalization. However, GPUs have limited memory and cannot scale to the needs of a billion scale network.
We notice that the top-of-the-line A100 GPU with 48 GB memory can barely accommodate a batch size of 256 for our 1.6BN network. To run 2048 bath sizes, we need to distribute the model and training over 4 A100 GPUS (on a machine that costs around $70000). Even 2 A100s cannot accommodate the training.
How to Incorporate BOLT Engine in Your Pipeline?
We package the Bolt Engine in a simple python API, allowing the user to specify the network structure easily. We also have simple methods on the network that will enable us to retrieve the trained parameters for each layer as NumPy arrays that can easily port the trained model in TensorFlow or PyTorch. We also provide simple functions which can perform this process automatically. Similarly, we also provide drop-in functionality. A user can specify a TensorFlow model and pass it to our script, which will train and return it using the Bolt Engine for acceleration, all with 2 lines of code change.