Deep neural networks have enabled technological marvels ranging from speech recognition to machine transition to protein engineering, but their design and application are notoriously unprincipled. The development of tools and methods to guide this process is one of the great challenges of deep learning theory. In Reverse Engineering the Neural Tangent Kernel, we propose a paradigm to bring a principle to the art of architecture design using recent theoretical advances: first design a good kernel function, often a much easier task, and then ” reverse engineer” a network. kernel equivalence to translate the chosen kernel to a neural network. Our main theoretical result allows activation functions to be designed from first principles, and we use it to create an activation function that mimics the performance of the deep (textrmReLU) network with only one hidden layer and another that clearly outperforms deep ( textrmReLU) networks in a synthetic task.
Cores return to networks. Foundational work derived formulas that map from large neural networks to their corresponding cores. We get an inverse mapping, which allows us to start from a desired core and convert it back into a network architecture.
Nuclei of neural networks
The field of deep learning theory has recently been transformed by the realization that deep neural networks often become analytically tractable to study. of infinite width limit Take the boundary in a certain way, and the network actually converges to an ordinary kernel method using the architecture’s “neural tangent kernel” (NTK) or, if only the last layer is trained (models of random features), its “neural core”. “Gaussian Network Process” (NNGP). Like the central limit theorem, these wide-network limits are often surprisingly good approximations even far from infinite width (often holding true at widths of hundreds or thousands), giving a remarkable analytical handle on the mysteries of the ‘deep learning.
From networks to cores and vice versa
The original works exploring this correspondence between the network core gave formulas to go there architecture a core: Given a description of an architecture (eg depth and activation function), you are given the two cores of the network. This has led to great insights into the optimization and generalization of various architectures of interest. However, if our goal is not only to understand existing architectures but to design new ones, then we could rather have the mapping in reverse: given a core we want, we can find one architecture who gives it to us? In this work, we derive this inverse mapping for fully connected networks (FCNs), which allows us to design simple networks in a principled way by (a) positioning a desired kernel and (b) designing an activation function that provides it .
To see why this makes sense, let’s first visualize an NTK. Consider a wide FCN NTK (K(x_1,x_2)) on two input vectors (x_1) and (x_2) (which for simplicity we will assume are normalized to the same length). For an FCN, this core is rotation-invariant in the sense that (K(x_1,x_2) = K(c)), where (c) is the cosine of the angle between the entries. Since (K(c)) is a scalar function of a scalar argument, we can simply represent it. Figure 2 shows the NTK of a hidden four-layer FCN (4HL) (textrmReLU).
Fig 2. The NTK of a 4HL $textrmReLU$ FCN as a function of the cosine between two input vectors $x_1$ and $x_2$ .
This plot actually contains a lot of information about the learning behavior of the corresponding wide network! Monotonic increase means that this kernel expects the closest points to have more correlated function values. The sharp increase at the end tells us that the correlation length is not too large and can accommodate complicated functions. The divergent derivative at (c=1) tells us the smoothness of the function we expect to obtain. It is important, none of these facts is evident by looking at a plot of (textrmReLU(z))! We claim that, if we want to understand the effect of choosing an activation function (phi), then the resulting NTK is actually more informative than (phi) itself. So it might make sense to try to design architectures in “kernel space”, and then translate them into typical hyperparameters.
An activation function for each core
Our main result is a “reverse engineering theorem” which states the following:
Theme 1: For any kernel $K(c)$, we can construct an activation function $tildephi$ such that, when inserted into a a single hidden layer FCN, its NTK or NNGP kernel of infinite width is $K(c)$.
We give an explicit formula for (tildephi) in terms of Hermite polynomials (although we use a different functional form in practice for trainability reasons). Our proposed use of this result is that, in problems with some known structure, it will sometimes be possible to write a good kernel and reverse engineer it into a trainable network with several advantages over pure kernel regression, such as efficiency computational and the ability to learn features. As a proof of concept, let’s test this idea synthetically parity problem (ie, given a string of bits, is the sum even or odd?), immediately generating an activation function that significantly outperforms (textReLU) in the task.
Is a hidden layer all you need?
Here is another surprising use of our result. The kernel curve above is for a 4HL (textrmReLU) FCN, but I claimed that we can achieve any kernel, including that one, with only one hidden layer. This implies that we can create a new activation function (tildephi) that gives this “deep” NTK in a shallow network! Figure 3 illustrates this experiment.
Fig 3. Deepening of a deep $textrmReLU$ FCN into a 1HL FCN with a designed activation function $tildephi$ .
Surprisingly, this “surfacing” actually works. The left subplot of Figure 4 below shows an activation function (tildephi) that provides virtually the same NTK as a deep FCN (textrmReLU). The right graphs below show train traces + trial loss + precision for three FCNs in a standard tabular problem on the UCI dataset. Note that although shallow and deep ReLU networks have very different behaviors, our designed shallow mimic network tracks the deep network almost exactly.
Fig 4. Left panel: Our designed “mock” activation function, plotted with ReLU for comparison. Right panels: performance traces for 1HL ReLU, 4HL ReLU, and 1HL mimic FCNs trained on a UCI dataset. Note the close match between the 4HL ReLU and 1HL mimic networks.
This is interesting from an engineering perspective because the shallow network uses fewer parameters than the deep network to achieve the same performance. It is also interesting from a theoretical perspective because it raises fundamental questions about the value of depth. A common belief that deep learning belief is that deeper is not only better, but qualitatively different: that deep networks will efficiently learn features that shallow networks simply cannot. Our shallowness result suggests that, at least for FCNs, this is not true: if we know what we’re doing, depth doesn’t buy us anything.
conclusion
This work comes with many caveats. Most importantly, our result only applies to FCNs, which themselves are rarely state-of-the-art. However, work on convolutional NTKs is progressing rapidly, and we believe that this paradigm of network design using kernel design is ripe for extension in some form to these structured architectures.
Theoretical work has thus far provided relatively few tools for practical deep learning theorists. We intend this to be a modest step in that direction. Even without a science to guide their design, neural networks have already enabled wonders. Imagine what we’ll be able to do with them when we finally have one.
This post is based on the paper “Reverse Engineering the Neural Tangent Kernel”, which is a joint work with Sajant Anand and Mike DeWeese. We provide code to reproduce all our results. We will be happy to answer your questions or comments.
Source link
At Ikaroa, we understand the need for groundbreaking innovations in the Artificial Intelligence (AI) research space. In recent years, the field of AI has seen tremendous advancements, and Artificial Intelligence research group at the University of California – Berkeley is at the forefront of cutting-edge technology. In their blog, the Berkeley Artificial Intelligence Research Group discusses the concept of “towards first-principles architecture design”, which is set to bring dramatic changes in the way AI models are designed.
The basic idea behind this approach is to design an AI architecture that is rooted in the first principles of computers and engineering. Instead of relying on an existing architecture such as those used in deep neural networks, architects can use the fundamental computational principles to create entirely new architectures. This could potentially lead to smarter, more efficient AI designs that are tailored for specific applications.
At Ikaroa, we believe in the power of first principles architecture. We are actively exploring new ways to drive the development of AI by taking a holistic view of designing architectures that integrate knowledge and data of countless AI technologies. We proudly stand alongside Berkeley AI Research Group in their efforts to prove that AI designs can be made highly customized, efficient and smarter without relying on existing architectures.
The Berkeley Artificial Intelligence Research Group blog is a great resource to learn more about the topic of first principles architecture design. It covers topics such as the design of algorithms, models, and layers, their affects and trade-offs, and how they can be used to create and optimize AI systems. We at Ikaroa highly encourage anyone interested in the development of AI to explore this resource and read the blog articles to gain a better understanding of this important concept.