We have skilled a mannequin to attain a brand new state-of-the-art in mathematical downside fixing by rewarding every appropriate step of reasoning (“course of supervision”) as a substitute of merely rewarding the right remaining reply (“final result supervision”). Along with boosting efficiency relative to final result supervision, course of supervision additionally has an vital alignment profit: it instantly trains the mannequin to provide a chain-of-thought that’s endorsed by people.