Are we in a GPT-4-style leap that evals can't see?

January 18, 2026, 2:09 AM

[!NOTE]
A great one would be to have a panel of designers rank screenshots of product outputs for a prompt - but instead we get endless math, science and SWE benchmarks that don’t really cover this.

[!TIP] Source link: Are we in a GPT-4-style leap that evals can’t see?

Categories: it
Tags: it