[!NOTE]
A great one would be to have a panel of designers rank screenshots of product outputs for a prompt - but instead we get endless math, science and SWE benchmarks that don’t really cover this.
[!TIP] Source link: Are we in a GPT-4-style leap that evals can’t see?