Exploring the Possibilities of Transformer Models with Optical Neural Networks

What happens if you run a transformer model with an optical neural network?

The exponentially increasing scale of deep-learning models is both a force for advancing the current state of the art and a growing source of concern over energy consumption, speed and, therefore the feasibility of massive-scale learning. Researchers from Cornell recently discussed Transformer topologies and how they were dramatically improved when scaled to billions or trillions of parameters. This led to an exponential increase in the use of deep learning computing. The large-scale Transformers have become a popular, but expensive solution to many tasks. This is because the energy efficiency of digital hardware has not kept pace with the increasing FLOP requirements for cutting-edge deep-learning models. These large-scale Transformers also perform impressively in many other domains such as graphs and multimodal settings.

They also exhibit transfer learning abilities, which allow them to generalize quickly to certain activities. This can be done in zero-shot environments without additional training. These models’ cost and general machine-learning abilities are the main driving forces behind hardware accelerators that allow for quick and effective inference. Deep learning hardware, such as GPUs, FPGAs and mobile accelerator chips, has been developed extensively in digital electronics. Optically-based neural networks are said to be more efficient and have a lower latency than digital neural networks. Analog computing is also gaining in popularity.

Although these analog systems can be susceptible to error and noise, they are often able to operate optically at a lower cost. The main cost is typically the overhead electrical cost associated with the loading of the weights and the data amortized over large linear operations. This makes the acceleration of large-scale models, such as Transformers, particularly promising. The scaling is theoretically more energy efficient per MAC than digital system. They show how Transformers are using this scaling to their advantage. They used a sample of operations from a Transformer to model language on an experimental system based on spatial light modulators. The results were then used to create a calibrated simulator of a Transformer running optically. The purpose of this was to demonstrate that Transformers can run on these systems, despite their noise characteristics and error characteristics.


What Happens If You Run A Transformer Model With An Optical Neural Network?