Turing.Bet

The Problem

Crowdsourced evals of LLMs suffer from a “nothing at stake” problem, where users are not required to bear any personal consequences for their votes, leading to a lack of careful judgment, potential bias, and undermining the reliability of the results.

The Solution

Vibe checks with skin in the game. (more details soon)

References

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
Dunlap, L., Mandal, K., Darrell, T., Steinhardt, J., & Gonzalez, J. (2024). VibeCheck: Discover & Quantify Qualitative Differences in Large Language Models arXiv:2410.12851.
Jamieson, K., Nowak, R. (2011). Active Ranking using Pairwise Comparisons arXiv:1109.3701.
Taleb, N. N. (2018). Skin in the game: Hidden asymmetries in daily life. Random House.