The idea of LLMs acting as agents that interact with tools and software (for example, navigating a website, editing spreadsheets, or fixing code) sounds very exciting. Unfortunately, many of these agentic LLMs still struggle with these tasks. For example, the best agentic framework on the SWE benchmark [1], which tests how well LLMs can fix GitHub issue…
Keep reading with a 7-day free trial
Subscribe to Text Generation to keep reading this post and get 7 days of free access to the full post archives.