Are we in a GPT-4-style leap that evals can't see?

[!NOTE]
A great one would be to have a panel of designers rank screenshots of product outputs for a prompt - but instead we get endless math, science and SWE benchmarks that don’t really cover this.

[!TIP] Source link: Are we in a GPT-4-style leap that evals can’t see?