Turing.Bet
The Problem
Crowdsourced evals of LLMs suffer from a “nothing at stake” problem, where users are not required to bear any personal consequences for their votes, leading to a lack of careful judgment, potential bias, and undermining the reliability of the results.
The Solution
Vibe checks with skin in the game. (more details soon)
References
Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
Dunlap, L., Mandal, K., Darrell, T., Steinhardt, J., & Gonzalez, J. (2024). VibeCheck: Discover & Quantify Qualitative Differences in Large Language Models arXiv:2410.12851.