Learning to play Minecraft with Video PreTraining

The Internet contains a wealth of publicly available videos that we can learn from. You can see a person giving a great presentation, a digital artist drawing a beautiful sunset, and a Minecraft player building a complicated house. However, these videos only provide a record of it what it happened but not exactly how has been achieved, meaning you won’t know the exact sequence of mouse movements and key presses. If we would like to build large-scale base models in these domains, as we have done in language with GPT, this lack of action tags presents a new challenge not present in the language domain, where “action tags” are simply the following words. in a sentence

In order to use the wealth of unlabeled video data available on the Internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by collecting a small dataset from contractors where we record not only their video but also the actions they took, which in our case are keystrokes and mouse movements. With this data we train an Inverse Dynamics Model (IDM), which predicts the action being taken at each step of the video. Importantly, IDM can use the past tense and future information to guess the action at each step. This task is much easier and thus requires much less data than the behavioral cloning task of predicting given actions previous video frames only, which requires inferring what the person wants to do and how to achieve it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act by cloning the behavior.

