Language models can explain neurons in language models

Although the vast majority of our explanations score low, we believe we can now use ML techniques to further improve our ability to produce explanations. For example, we found that we were able to improve scores by:

  • Iteration of explanations. We can increase scores by asking GPT-4 to present possible counterexamples and then revising the explanations in light of their activations.
  • Using larger models to provide explanations. The average score increases as the capabilities of the explanatory model increase. However, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
  • Change of model architecture explained. Training models with different activation functions improved the explanation scores.

We are open-sourcing our datasets and visualization tools for the GPT-4 written explanations of the 307,200 GPT-2 neurons, as well as the code for the explanation and scoring using publicly available models at OpenAI API. We hope that the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT-2 through explanations.

We found over 1,000 neurons with explanations that scored at least 0.8, meaning that, according to GPT-4, they account for most of the neuron’s higher firing behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT-4 did not understand. We hope that as the explanations improve, we can quickly discover an interesting qualitative understanding of the models’ calculations.

Source link
Language models, a staple tool in natural language processing, have vastly improved the accuracy of many tasks in AI. Their ability to capture the complexities of language without being explicitly programmed is a remarkable feat, and has produced superior results to tasks such as question answering, machine translation, and sentiment analysis. Recently, more research has been conducted to understand which features of these models allow them to be successful and why they are more efficient than traditional methods. One intriguing result is that the neurons in language models can be interpreted as a representation of linguistic features such as words, part-of-speech, and syntax.

At the heart of these models are recurrent neural networks, which are applications of artificial neural networks that receive input from the previous output or previous step. The output of each neuron in the network is calculated according to the input and the network weights, and each neuron has a particular task in the operation of the language model. For example, some neurons are specialized for identifying words, and others may pick up on certain syntactical patterns.

At Ikaroa, we are researching the inner workings of language models and the neurons that control their operation. We are exploring ways to interpret their neurons as meaningful representations of language and to uncover the logic behind their cognitive operations. Our findings are increasingly allowing us to better understand the principles governing language models, as well as to better optimize them for different tasks. By interpreting the neurons in language models, we hope to be able to more accurately predict the output of a given task and gain new insights on how language works.


Leave a Reply

Your email address will not be published. Required fields are marked *