Eunsu Kim

I am a master's student advised by Professor Alice Oh at School of Computing, KAIST.

My current research focuses on the evaluation of LLMs, specifically:
(i) What to evaluate: Exploring the direction in which LLMs should progress and examining current LLM behavior from that perspective [3, 4]. (ii) How to evaluate: Developing evaluation frameworks/metrics that measure the true capabilities of LLMs [2, 5]. (iii) Interesting behaviors during evaluation: Digging into the behaviors observed in (1)&(2) [1,5].
I believe accurate evaluation in the proper context can guide LLMs to develop in meaningful and appropriate directions.

Currently, I'm working on developing an LLM evaluation framework that is reliable and interpretable, and benchmarking cultural awareness of (V)LM in various interesting scenarios.

If you would like to collaborate with me or have any questions, Feel free to contact me!

Email / Scholar / LinkedIn / Twitter(X) / CV

Education

Korea Advanced Institute of Science and Technology (KAIST)

M.S. in Computer Science, Advisor: Alice Oh 2023.09-present

B.S. in Electrical Engineering 2019.03-2023.08 ● GPA: 4.02/4.3, Major GPA: 4.15/4.3 (Summa Cum Laude)

Publications

* denotes equal contributions

[9] FLEX-TRAVELPLANNER: A BENCHMARK FOR FLEXIBLE PLANNING WITH LANGUAGE AGENTS
Juhyun Oh, Eunsu Kim, Alice Oh
ICLR 2025 workshop for reasoning and planning
paper

[8] WHEN TOM EATS KIMCHI: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts ?
Jun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice Oh
NAACL 2025 c3NLP, outstanding paper
TL;DR

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs, where the elements in the input represent multiple cultures. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to mixed cultures, we introduce MIXCUBE, a cross-cultural awareness benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures.

[7] Diffusion Models Through a Global Lens: Are They Culturally Inclusive?
Zahra Bayramli, Ayhan Suleymanzade, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh
ACL 2025 (Oral), NAACL 2025 c3NLP workshop
arXiv TL;DR

Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CultDiff benchmark, evaluating state-of-the-art diffusion models whether they can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.

[6] LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
ACL 2025 Findings
arXiv codebase TL;DR

LLM-as-an-Interviewer is an evaluation framework that assesses the capabilities of LLMs through an interview-style process. In this approach, the LLM acting as the interviewer evaluates other LLMs by providing feedback and asking follow-up questions, enabling a more comprehensive assessment of their capabilities.

[5] Uncovering Factor Level Preferences to Improve Human-Model Alignment
Juhyun Oh*, Eunsu Kim*, Jiseon Kim, Wenda Xu, William Yang Wang, Alice Oh
Under Review

arXiv

[4] BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty,Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh
Neurips D&B, 2025
arXiv Dataset TL;DR
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages.

[3] CLIcK: Evaluation of Cultural and Linguistic Intelligence in Korean
Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, Alice Oh
LREC-COLING 2024
arXiv Dataset TL;DR

We construct and release CLIcK, a culturally-aware evaluation benchmark dataset encompassing 1,995 instances across 11 categories representing facets of the Korean culture, ranging from everyday life to specific subject areas, as well as Korean grammar and linguistics.

[2] Multi-FAct: Assessing Multilingual LLMs' Multi-Regional Knowledge using FActScore
Sheikh Shafayat, Eunsu Kim*, Juhyun Oh*, Alice Oh
COLM 2024, Workshop on Global AI Cultures at ICLR 2024
arXiv TL;DR

We introduce a novel pipeline tailored for evaluating factuality in a multilingual setting. Our approach first adapts the FActScore (Min et al., 2023) to accommodate multiple languages, and we make this pipeline openly accessible as open-source.

[1] The Generative AI Paradox in Evaluation: "What It Can Solve, It May Not Evaluate
Juhyun Oh*, Eunsu Kim*, Inha Cha*, Alice Oh
EACL SRW 2024
arXiv TL;DR

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset.

Misc

🥯🥐🥨🏓💃🧗🏋️‍♂️

Template from Jon Barron's wonderful work.