Eunsu Kim

I am a master's student advised by Professor Alice Oh at the School of Computing, KAIST. Previously, I was a visiting scholar at Carnegie Mellon University, working with Professor Sherry Tongshuang Wu.

I am currently interested in the science of evaluation: what and how we should evaluate in AI systems, with the broader goal of helping everyday users, researchers, and other stakeholders interact and collaborate more effectively with AI. Feel free to reach out if you'd like to collaborate!

✉ Email 🎓 Scholar 👥 LinkedIn 𝕏 Twitter 📄 CV

My happiest moment so far!
📍 Ponte Luís I, Porto

Affiliations

Carnegie Mellon University

(Incoming) PhD in School of Computer Science, Language Technology Institution Sep 2026 –

Carnegie Mellon University

Visiting Scholar in HCII, Host: Sherry Wu Sep 2025 – Mar 2026

KAIST

M.S. in Computer Science, Advisor: Alice Oh Sep 2023 – Present

B.S. in Electrical Engineering Mar 2019 – Aug 2023

GPA 4.02 / 4.3 · Major 4.15 / 4.3 · Summa Cum Laude

Latest News

Apr 2026

🏆 Three papers have been accepted to ACL 2026 Main/Findings/Industry track. See you in San Diego!

Apr 2026

🏆 Are they Lovers or Friends? accepted to ACL 2026 Main!

Feb 2026

🏆 Culture-Mixing paper to be presented at CVPR 2026!

Dec 2025

🏆 Two papers at NeurIPS 2025 Workshops: BenchHub @ Efficient Reasoning & ML-IAM @ Climate Change with ML

Sep 2025

🇺🇸 Starting as a visiting student at Carnegie Mellon University

Aug 2025

🏆 Two papers accepted to EMNLP 2025 Findings: Uncovering Factor Level Preferences & MUG-Eval

May 2025

🏆 Three papers at ACL 2025 — two Findings + one Main (Oral)!

Mar 2025

🏆 "When Tom Eats Kimchi" won Outstanding Paper Award at NAACL C3NLP!

Selected Publications See all →

* denotes equal contributions

LLM Evaluation

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Eunsu Kim*, Haneul Yoo*, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh

Preprint, Under Review

Paper

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song*, Seogyeong Jeong*, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh

EMNLP 2025 (Findings)

Paper

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation

Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh

ACL 2025 (Findings)

arXiv Code TL;DR

An evaluation framework that assesses LLM capabilities through an interview-style process. The interviewer LLM evaluates other LLMs by providing feedback and asking follow-up questions, enabling more comprehensive capability assessment beyond static benchmarks.

Uncovering Factor Level Preferences to Improve Human-Model Alignment

Juhyun Oh*, Eunsu Kim*, Jiseon Kim, Wenda Xu, William Yang Wang, Alice Oh

EMNLP 2025 (Findings)

arXiv

Social & Cultural Aspects of (V)LLMs

Are they Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Doğruöz, Alice Oh, Najoung Kim

ACL 2026

arXiv

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Eunsu Kim*, Junyeong Park*, Na Min An*, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh

CVPR 2026

arXiv

Diffusion Models Through a Global Lens: Are They Culturally Inclusive?

Zahra Bayramli*, Ayhan Suleymanzade*, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh

ACL 2025 Oral, NAACL 2025 C3NLP Workshop

arXiv TL;DR

Text-to-image diffusion models can create compelling images from prompts, but their ability to represent cultural nuances remains limited. This work introduces the CultDiff benchmark, evaluating models on generating culturally specific images across ten countries, revealing significant disparities in cultural relevance, especially for underrepresented regions.

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, et al.

NeurIPS D&B 2025

arXiv Dataset TL;DR

A hand-crafted benchmark evaluating LLMs' everyday cultural knowledge across 16 countries/regions in 13 languages, including low-resource ones. Results show LLMs perform significantly better for cultures highly represented online, with up to a 57% gap in GPT-4 performance.

CLIcK: Evaluation of Cultural and Linguistic Intelligence in Korean

Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh

LREC-COLING 2024

arXiv Dataset TL;DR

A culturally-aware evaluation benchmark with 1,995 instances across 11 categories of Korean culture, spanning everyday life to specialized subjects, as well as Korean grammar and linguistics.

Beyond Research

I love bread 🥯🥐🥨, table tennis 🏓, and learning new sports. I recently started tennis and yoga!