Selected Publications
See all →
* denotes equal contributions
Are they Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Doğruöz, Alice Oh, Najoung Kim
ACL 2026
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Eunsu Kim*, Junyeong Park*, Na Min An*, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
CVPR 2026
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
Eunsu Kim*, Haneul Yoo*, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh
Preprint, Under Review
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
Seyoung Song*, Seogyeong Jeong*, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh
EMNLP 2025 (Findings)
Diffusion Models Through a Global Lens: Are They Culturally Inclusive?
Zahra Bayramli*, Ayhan Suleymanzade*, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh
ACL 2025 Oral, NAACL 2025 C3NLP Workshop
Text-to-image diffusion models can create compelling images from prompts, but their ability to represent cultural nuances remains limited. This work introduces the CultDiff benchmark, evaluating models on generating culturally specific images across ten countries, revealing significant disparities in cultural relevance, especially for underrepresented regions.
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
ACL 2025 (Findings)
An evaluation framework that assesses LLM capabilities through an interview-style process. The interviewer LLM evaluates other LLMs by providing feedback and asking follow-up questions, enabling more comprehensive capability assessment beyond static benchmarks.
Uncovering Factor Level Preferences to Improve Human-Model Alignment
Juhyun Oh*, Eunsu Kim*, Jiseon Kim, Wenda Xu, William Yang Wang, Alice Oh
EMNLP 2025 (Findings)
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, et al.
NeurIPS D&B 2025
A hand-crafted benchmark evaluating LLMs' everyday cultural knowledge across 16 countries/regions in 13 languages, including low-resource ones. Results show LLMs perform significantly better for cultures highly represented online, with up to a 57% gap in GPT-4 performance.
CLIcK: Evaluation of Cultural and Linguistic Intelligence in Korean
Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh
LREC-COLING 2024
A culturally-aware evaluation benchmark with 1,995 instances across 11 categories of Korean culture, spanning everyday life to specialized subjects, as well as Korean grammar and linguistics.