KAN: Kolmogorov-Arnold NetworksZiming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max TegmarkInspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parameterized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.\https://arxiv.org/abs/2404.19756
Before we dig into the paper and the implementations, let's first unpack MLPs, the current dominant form of AI models.
The Universal Approximation Theorem and Multi-Layer Perceptrons
The MLP (Multi-Layer Perceptron) is useful because of the Universal Approximation Theorem. The Universal Approximation Theorem states that for any function $$f : \mathbb{R}^n \to \mathbb{R}^m$$
There exists a neural network with one hidden layer that can approximate \( f \) on compact subset of \(\mathbb{R}^n\) to arbitrary accuracy, given by the parameter \( \epsilon \). The paper writes out the formula:
$$ f({\bf x}) \approx \sum_{i=1}^N(\epsilon) a_i \sigma({\bf w}_i \cdot {\bf x} + b_i)$$
Where \(N(\epsilon)\) is the number of steps required to achieve an accuracy of within \( \epsilon \) of the real function \(f\). The \({\bf w}_i\) are the weights that are discovered through training. The \(b_i\) are the biases, which are also learned. The learning algorithm used is called backpropagation. If you are interested in how backprop works, check out this excellent post by Chris Olah from 2015. He shows that backprop is derived from the chain rule in calculus.
This result is the theoretical underpinning of today's neural networks. But what is this Kolmogorov-Arnold representation theorem?
There exists a neural network with one hidden layer that can approximate \( f \) on compact subset of \(\mathbb{R}^n\) to arbitrary accuracy, given by the parameter \( \epsilon \). The paper writes out the formula:
$$ f({\bf x}) \approx \sum_{i=1}^N(\epsilon) a_i \sigma({\bf w}_i \cdot {\bf x} + b_i)$$
Where \(N(\epsilon)\) is the number of steps required to achieve an accuracy of within \( \epsilon \) of the real function \(f\). The \({\bf w}_i\) are the weights that are discovered through training. The \(b_i\) are the biases, which are also learned. The learning algorithm used is called backpropagation. If you are interested in how backprop works, check out this excellent post by Chris Olah from 2015. He shows that backprop is derived from the chain rule in calculus.
This result is the theoretical underpinning of today's neural networks. But what is this Kolmogorov-Arnold representation theorem?
The Kolmogorov-Arnold representation theorem and KANs
The alternative architecture proposed in the arXiv paper is called a KAN (Kolmogorov-Arnold Network), which depends on the Kolmogorov-Arnold Representation theorem. This theorem is exact, unlike the MLP, here is the statement of the theorem, assume the same \( f : \mathbb{R}^n \to \mathbb{R}^m \), then:
$$ f({\bf x}) = \sum_{q=1}^{2n+1} \Phi_q\left( \sum_{p=1}^n(\phi_{q,p}(x_p)\right)$$
Other than being exact, the AI-relevant thing that KANs have is about activation functions. On an MLP, the neurons hold a number, and the value of that number can trigger an "activation", which is analogous to a neuron firing. The activation functions used vary, some popular choices are ReLU, tanh and various sigmoids.
KANs have learnable activation functions on edges, and there's also a sum operation on nodes. This means there are no linear weight matrices!
One of the motivations for KANs is that they are more interpretable than MLPs. Interpreting the internal structure of the model is a huge problem, so this is a good reason to learn their strengths and weaknesses. Another thing that KANs promise is the ability to be more accurate and use less compute for training time.
In the next post we will dig into how to write and train one of these things and compare/contrast it to the MLP equivalent.
$$ f({\bf x}) = \sum_{q=1}^{2n+1} \Phi_q\left( \sum_{p=1}^n(\phi_{q,p}(x_p)\right)$$
Other than being exact, the AI-relevant thing that KANs have is about activation functions. On an MLP, the neurons hold a number, and the value of that number can trigger an "activation", which is analogous to a neuron firing. The activation functions used vary, some popular choices are ReLU, tanh and various sigmoids.
KANs have learnable activation functions on edges, and there's also a sum operation on nodes. This means there are no linear weight matrices!
One of the motivations for KANs is that they are more interpretable than MLPs. Interpreting the internal structure of the model is a huge problem, so this is a good reason to learn their strengths and weaknesses. Another thing that KANs promise is the ability to be more accurate and use less compute for training time.
In the next post we will dig into how to write and train one of these things and compare/contrast it to the MLP equivalent.