state-of-the-art coding performance with fewer than 10 B
More advanced and challenging multi-task evaluation