Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update 3 days ago
Post
1509
๐—”๐—ฟ๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฒ๐—ป๐—ผ๐˜‚๐—ด๐—ต ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ? โ‡’ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜„๐—ถ๐˜๐—ต ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐Ÿ“Š

A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโ€ฆ

โžก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.

๐—ง๐—ต๐—ฒ ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜
โ–ช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks
โ–ช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
โ–ช๏ธDifference with previous DS benchmarks:
โถ This benchmark leverages various modalities on top of text: images, Excel files, tables
โท Complex tables: sometimes several tables should be leveraged to answer one question
โธ The context is richer, with longer descriptions.
โ–ช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.

๐—œ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€
โ–ช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
โ–ช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.

This new benchmark is really welcome, can't wait to try transformers agents on it! ๐Ÿค—

Read their full paper ๐Ÿ‘‰ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)
In this post