Eunsu Kim


I am a visiting scholar at Carnegie Mellon University, working with Professor Sherry Tongshuang Wu. I am also a master's student advised by Professor Alice Oh at the School of Computing, KAIST.

My research aims to develop AI systems and agents that serve as meaningful bridges: connecting individuals, societies, and humans with intelligent agents.
Currently, I've been focusing on two questions: (1) How effectively can large language models (LLMs) assist humans in real-world contexts? (2) How well do they understand and represent diverse multicultural and multilingual societies?

My ongoing projects explore human-AI collaboration and the evaluation of VLLMs’ social and cultural capabilities. If you would like to collaborate with me or have any questions, Feel free to contact me!

Email  /  Scholar  /  LinkedIn  /  Twitter(X)  /  CV
profile photo hover profile photo

Affiliations

Carnegie Mellon University (CMU)

Visiting Scholar in HCII Computer Science, Host professor: Sherry Wu 2025.09-present

Korea Advanced Institute of Science and Technology (KAIST)

M.S. in Computer Science, Advisor: Alice Oh 2023.09-present

B.S. in Electrical Engineering 2019.03–2023.08 GPA: 4.02/4.3, Major GPA: 4.15/4.3 (Summa Cum Laude)

Latest News!

🏆 Two papers will be presented in Neurips workshop!
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation @ Efficient Reasoning Workshop
ML-IAM: Emulating Integrated Assessment Models With Machine Learning @ Tackling Climate Change with Machine Learning Workshop
Dec 2025

🇺🇸 Starting as a visiting student at Carnegie Mellon University (CMU)! Sep 2025

🏆 Two papers accepted to EMNLP 2025 Findings!
Uncovering Factor Level Preferences to Improve Human-Model Alignment | MUG-Eval
Aug 2025

🏆 Three papers accepted to ACL 2025 — two in Findings and one as Main (Oral)!
LLM-as-an-Interviewer | Spotting Out-of-Character Behavior | Diffusion Models Through a Global Lens
May 2025

🏆 `When Tom Eats Kimchi' paper got outstanding paper awards in NAACL C3NLP! Congrats to my interns! Mar 2025

Selected Publications (See All)

* denotes equal contributions

  • BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
    Eunsu Kim*, Haneul Yoo*, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh
    Preprint, Under Review
    paper
  • MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
    Seyoung Song*, Seogyeong Jeong*, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh
    EMNLP 2025(Findings)
    paper
  • Diffusion Models Through a Global Lens: Are They Culturally Inclusive?
    Zahra Bayramli*, Ayhan Suleymanzade*, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh
    ACL 2025 (Oral), NAACL 2025 c3NLP workshop
    arXiv TL;DR

    Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CultDiff benchmark, evaluating state-of-the-art diffusion models whether they can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.

  • LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation
    Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
    ACL 2025 Findings
    arXiv codebase TL;DR

    LLM-as-an-Interviewer is an evaluation framework that assesses the capabilities of LLMs through an interview-style process. In this approach, the LLM acting as the interviewer evaluates other LLMs by providing feedback and asking follow-up questions, enabling a more comprehensive assessment of their capabilities.

  • Uncovering Factor Level Preferences to Improve Human-Model Alignment
    Juhyun Oh*, Eunsu Kim*, Jiseon Kim, Wenda Xu, William Yang Wang, Alice Oh
    EMNLP 2025 (Findings)
    arXiv
  • BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
    Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty,Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh
    Neurips D&B, 2025
    arXiv Dataset TL;DR

    Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages.

  • CLIcK: Evaluation of Cultural and Linguistic Intelligence in Korean
    Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh
    LREC-COLING 2024
    arXiv Dataset TL;DR


    We construct and release CLIcK, a culturally-aware evaluation benchmark dataset encompassing 1,995 instances across 11 categories representing facets of the Korean culture, ranging from everyday life to specific subject areas, as well as Korean grammar and linguistics.

  • Misc

    Besides research, I love bread 🥯🥐🥨, table tennis 🏓, and learning new sports. I recently started tennis and yoga!

    Template from Jon Barron's wonderful work.