GPT-4 is the latest milestone of OpenAI’s efforts to scale up deep learning. GPT-4, a large, multimodal model that accepts image and text inputs and emits text outputs, is capable of performing at a human level on varying professional and academic benchmarks. GPT-4, for example, passes a bar exam simulation with a score in the top 10%; GPT-3.5 scored around the bottom 10%. After 6 months of iteratively aligning GPT-4, we’ve used lessons from both our adversarial test program and ChatGPT. This has resulted in our best ever results (though they are far from perfect) for factuality, steeringability, as well as refusing to leave guardrails.
In the last two years we have rebuilt our deep learning stack, and together with Azure co-designed a powerful supercomputer for our workload. As a \”test run\”, we trained GPT 3.5 a year ago. We fixed some bugs, and we improved our theoretical basis. Our GPT-4 run was, (for us) at least! Our first large model, whose performance in training we could accurately predict before time, was unprecedentedly stable. We will continue to work on reliable scaling and refine our methodology in order to predict future capabilities more accurately. This is something we consider critical to safety.
GPT-4 text input is now available via ChatGPT (with a waiting list) and the API. We’re working closely with one partner to prepare the image entry capability for wider distribution. OpenAI Evals is our framework for automated model evaluation. Anyone can use it to report any shortcomings.