Pipe parallelism divides a model “vertically” by layer. It is also possible to “horizontally” split certain operations within a layer, which is usually called Parallel Tensor formation. For many modern models (like the Transformer), the computation bottleneck is multiplying an activation batch matrix with a heavy weight matrix. Matrix multiplication can be thought of as dot products between pairs of rows and columns; it is possible to compute independent dot products on different GPUs, or compute parts of each dot product on different GPUs and summarize the results. With either strategy, we can split the weight matrix into uniformly sized “chunks”, host each chunk on a different GPU, and use that chunk to compute the relevant part of the global matrix product before communicating later to combine the results.
An example is Megatron-LM, which parallelizes matrix multiplications within the Transformer’s MLP and self-attentive layers. PTD-P uses tensor, data and pipe parallelism; its pipelining allocates multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network communication.
Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-talk. Sequence parallelism is one such idea, where an input sequence is split over time into multiple sub-examples, proportionally decreasing the maximum memory consumption by allowing the computation to continue with more granular-sized examples.
Source link
As the technologies of computer vision, robotics, and machine learning continue to evolve, the ability to create and train large neural networks has become an important priority. At Ikaroa, a full stack tech company, we are committed to helping bring the latest research in these technologies to the masses and for this reason, we have explored the various techniques for training large neural networks.
Generally, when training a neural network, the goal is to make sure the model learns from existing data to form its own representations, making any subsequent predictions more accurate. To facilitate effective training of large neural networks, here are a few techniques that we consider pertinent.
The first technique we consider involves mini-batching. Mini-batching is a technique that involves dividing the training set into mini-batches or small batches of training data. This method helps reduce the processing time since it isn’t necessary to process the entire training set in one run. Furthermore, the small batches of data can be trained in parallel, which is beneficial when training large datasets.
Another technique that can be employed is the use of regularization. This technique involves the introduction of a form of penalty to the model such as weight decay or dropout to help the model remain generalizable. The use of regularization helps reduce the risk of over-fitting by making sure the model does not remember small changes in the training data and performs better on unseen data.
Finally, another important technique is to keep the size of a model small. By doing so, the model can be trained with less resources and will be easier to deploy. Techniques such as pruning or quantization can be used to cut down the model size but still keeping its accuracy.
At Ikaroa, we constantly strive to keep our models accurate, efficient and generalizable and these techniques have proven to be beneficial in achieving this goal. With these techniques, we are able to effectively train our large-scale neural networks and continuously improve the development of our technologies for future applications.