Akari Asai

I am an incoming Assistant Professor at Carnegie Mellon University (Fall 2026-), affiliated with the Language Technologies Institute and (by courtesy) the Machine Learning Department and a research scientist at the Allen Institute for AI (2025-2026).

I’ve completed my Ph.D. in NLP at Paul G. Allen School of Computer Science & Engineering, University of Washington. I am fortunate to be advised by Prof. Hannaneh Hajishirzi. I was also spending time at Meta AI Research as a visiting student researcher, under the supervision of Dr. Wen-tau Yih. Prior to joining UW, I obtained a B.E. in Electrical Engineering and Computer Science from The University of Tokyo, Japan.

My research focuses on natural language processing and machine learning, with a particular emphasis on large language models (LLMs). I investigate the core limitations of LLMs—such as hallucinations—that cannot be overcome by scaling alone. To address these challenges, my Ph.D. pioneered Retrieval-Augmented LMs, a novel class of LLMs that integrate large-scale text data via retrieval during inference. My PhD thesis is available: thesis (PDF), vieo (youtube). In summary, my Ph.D. focused on

Establishing the Necessity of Retrieval-Augmented LMs – I systematically identified LLM failure modes and pioneered retrieval-augmented approaches to address them. My research demonstrated their effectiveness in reducing hallucinations (ACL 2023), enabling more compute-efficient scaling (NeurIPS 2024), and supporting real-time knowledge updates (NeurIPS DB 2023). I led the first tutorial on this area at ACL 2023.
Building the Foundations of Retrieval-Augmented LMs – I established key components of Retrieval-Augmented LMs, developing new architectures and training/inference strategies (ICLR 2024; , NAACL 2022;COLM 2024). I advanced retrieval systems for flexibility (Findings of ACL 2023), robustness to complex queries (ICLR 2020; Findings of EMNLP 2023; EMNLP 2020), and more efficiency (ACL 2021).
Making Real-world Impacts through Retrieval-Augmented LMs – I apply Retrieval-Augmented LMs to real-world challenges, including expert-domain tasks such as Science (Open Scholar, 2025) or code generation (Findings of NAACL 2025), and multilingual information access (NeurIPS 2021; NAACL 2021; Findings of EMNLP, 2022; NAACL 2023; ACL 2023 (industry), empowering broader communities to access reliable information.

My work has received multiple paper awards at conferences like ACL and NeurIPS workshop, and has been featured in major media outlets such as Forbes and MIT Technology Review. I’m honored to be named among the Forbes 30 Under 30 Asia in Science , MIT Technology Review Innovators Under 35 from Japan (2024), EECS Rising Stars (2022), and the IBM Global Ph.D. Fellows (2022-2023). My work is now integrated into major libraries like Hugging Face, LlamaIndex and LangChain, and used in multiple real-world systems, such as COVID-19 Research Search. Most recently, we released Ai2 OpenScholar Public Demo, assisting more than 30k scientists across scientific disciplines to synthesize scientific literature more effectively and efficiently.

Public office hours and application materials:

To help lower barriers to starting research, pursuing a Ph.D. in this field or job search, I host weekly office hours open to all every Friday. Feel free to sign up via (please sign up from Google Calendar!).

Inspired by many wonderful friends who have shared their own materials to promote equity and access, I’ve also made my past application materials available:

Academic job application (2024): [Research Statement], [Teaching Statement], [Diversity Statement], [Job Talk Slides], [Job Talk Video (defense recording)]
EECS Rising Stars (2022): [Research Statement]
PhD application (2018): [SoP draft] (Note: This is a near-final draft, as I don’t have access to the original version. For examples of CS Statements of Purpose, I recommend checking out cs-sop, which includes many CS SoP of previous applicants!)

news

Jul 15, 2025	I’ll be joining Carnegie Mellon University as an Assistant Professor in Fall 2026, affiliated with the Language Technologies Institute and (by courtesy) the Machine Learning Department! From July 2025 to August 2026, I’ll be a Research Scientist at the Allen Institute for AI!
Jun 15, 2025	I’ve completed my Ph.D! My Ph.D. thesis is available here and you can see the video of my defense on Youtoube.
May 30, 2025	I’m organizing the COLM 2025 Workshop on LLMs for Science as well as the NeurIPS 2025 Competition on Retrieval-Augmented Generation in the Real World. Stay tuned for more updates - we’d love to have you involved!
May 15, 2025	Honored to be named to the Forbes 30 Under 30 Asia 2025 in Science!
May 02, 2025	I gave invited talks at NAACL Repl4NLP and Foundation Models for Science Workshop at Flatiron Institute.

selected publications

See my full publications at the publication page!

OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs

Akari Asai , Jacqueline He , Rulin Shao , Weijia Shi , Amanpreet Singh , and 20 more authors

Preprint, 2024

PDF Code Website
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Rulin Shao , Jacqueline He , Akari Asai , Weijia Shi , Tim Dettmers , and 3 more authors

In Advances in Neural Information Processing Systems (NeurIPS) , 2024

PDF Code Website
Fine-grained Hallucination Detection and Editing for Language Models

Abhika Mishra , Akari Asai , Yizhong Wang , Vidhisha Balachandran , Graham Neubig , and 2 more authors

In Conference on Language Modeling (COLM) , 2024

Abs PDF Code Website

Large language models (LMs) are prone to generate diverse factually incorrect statements, which are widely called hallucinations. Current approaches predominantly focus on coarse-grained automatic hallucination detection or editing, overlooking nuanced error levels. In this paper, we propose a novel task – automatic fine-grained hallucination detection – and present a comprehensive taxonomy encompassing six hierarchically defined types of hallucination. To facilitate evaluation, we introduce a new benchmark that includes fine-grained human judgments on two LM outputs across various domains. Our analysis reveals that ChatGPT and Llama 2-Chat exhibit hallucinations in 60% and 75% of their outputs, respectively, and a majority of these hallucinations fall into categories that have been underexplored. As an initial step to address this, we train FAVA, a retrieval-augmented LM by carefully designing synthetic data generations to detect and correct fine-grained hallucinations. On our benchmark, our automatic and human evaluations show that FAVA significantly outperforms ChatGPT on fine-grained hallucination detection by a large margin though a large room for future improvement still exists. FAVA’s suggested edits also improve the factuality of LM-generated text, resulting in 5-10% FActScore improvements.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai , Zeqiu Wu , Yizhong Wang , Avirup Sil , and Hannaneh Hajishirzi

In The Twelfth International Conference on Learning Representations (ICLR; Oral, Top 1%) , 2024

Abs PDF Code Website

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM’s quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
Reliable, Adaptable, and Attributable Language Models with Retrieval

Akari Asai , Zexuan Zhong , Danqi Chen , Pang Wei Koh , Luke Zettlemoyer , and 2 more authors

arXiv preprint, 2024

Abs PDF

Parametric language models (LMs), which are trained on vast amounts of web data, exhibit remarkable flexibility and capability. However, they still face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. In this position paper, we advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs. By incorporating large-scale datastores during inference, retrieval-augmented LMs can be more reliable, adaptable, and attributable. Despite their potential, retrieval-augmented LMs have yet to be widely adopted due to several obstacles: specifically, current retrieval-augmented LMs struggle to leverage helpful text beyond knowledge-intensive tasks such as question answering, have limited interaction between retrieval and LM components, and lack the infrastructure for scaling. To address these, we propose a roadmap for developing general-purpose retrieval-augmented LMs. This involves a reconsideration of datastores and retrievers, the exploration of pipelines with improved retriever-LM interaction, and significant investment in infrastructure for efficient training and inference.
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen* , Akari Asai* , Victor Zhong , Rajarshi Das , Daniel Khashabi , and 1 more author

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL; Oral, Best Video Paper Award – Most Viewed) , 2023

Abs PDF Video Code

Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the difficulty of encoding a wealth of world knowledge in their parameters. This paper aims to understand LMs’ strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments on two open-domain entity-centric QA datasets: PopQA, our new dataset with 14k questions about long-tail entities, and EntityQuestions, a widely used open-domain QA dataset. We find that LMs struggle with less popular factual knowledge, and that retrieval augmentation helps significantly in these cases. Scaling, on the other hand, mainly improves memorization of popular knowledge, and fails to appreciably improve memorization of factual knowledge in the tail. Based on those findings, we devise a new method for retrieval-augmentation that improves performance and reduces inference costs by only retrieving non-parametric memories when necessary.
Task-aware Retrieval with Instructions

Akari Asai , Timo Schick , Patrick Lewis , Xilun Chen , Gautier Izacard , and 3 more authors

In Findings of the Association for Computational Linguistics: ACL 2023 (Findings Spotlight) , 2023

Abs PDF Video Code

We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent along with their queries. We aim to develop a general-purpose task-aware retrieval system using multi-task instruction tuning, which can follow human-written instructions to find the best documents for a given query. We introduce the first large-scale collection of approximately 40 retrieval datasets with instructions, BERRI, and present TART, a multi-task retrieval system trained on BERRI with instructions. TART shows strong capabilities to adapt to a new retrieval task via instructions and advances the state of the art on two zero-shot retrieval benchmarks, BEIR and LOTTE, outperforming models up to three times larger. We further introduce a new evaluation setup, X^2-Retrieval to better reflect real-world scenarios, where diverse domains and tasks are pooled and a system needs to find documents aligning users’ intents. In this setup, TART significantly outperforms competitive baselines, further demonstrating the effectiveness of guiding retrieval with instructions.
Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks

Akari Asai , Matt Gardner , and Hannaneh Hajishirzi

In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL; Oral) , 2022

Abs PDF Video Code

Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open-domain question answering and fact verification. These models are trained to generate a final output given retrieved passages that can be irrelevant to an input query, leading to learning spurious cues or memorization. This work introduces a method to incorporate \textitevidentiality of passages—whether a passage contains correct evidence to support the output—into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the \textitevidentiality of each passage. Furthermore, we introduce a new task-agnostic method for obtaining high-quality \textitsilver evidentiality labels, addressing the issues of gold evidentiality labels being unavailable in most domains. Our experiments on five datasets across three knowledge-intensive tasks show that our new evidentiality-guided generator significantly outperforms its direct counterpart on all of them, and advances the state of the art on three of them. Our analysis shows that multi-task learning and silver evidentiality mining play key roles. Our code is available at \urlhttps://github.com/AkariAsai/evidentiality_qa
One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval

Akari Asai , Xinyan Yu , Jungo Kasai , and Hanna Hajishirzi

In Advances in Neural Information Processing Systems (NeurIPS) , 2021

Abs PDF Video Code

We present Cross-lingual Open-Retrieval Answer Generation (CORA), the first unified many-to-many question answering (QA) model that can answer questions across many languages, even for ones without language-specific annotated data or knowledge sources.We introduce a new dense passage retrieval algorithm that is trained to retrieve documents across languages for a question.Combined with a multilingual autoregressive generation model, CORA answers directly in the target language without any translation or in-language retrieval modules as used in prior work. We propose an iterative training method that automatically extends annotated data available only in high-resource languages to low-resource ones. Our results show that CORA substantially outperforms the previous state of the art on multilingual open QA benchmarks across 26 languages, 9 of which are unseen during training. Our analyses show the significance of cross-lingual retrieval and generation in many languages, particularly under low-resource settings.
XOR QA: Cross-lingual Open-Retrieval Question Answering

Akari Asai , Jungo Kasai , Jonathan Clark , Kenton Lee , Eunsol Choi , and 1 more author

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL; Oral) , 2021

Abs PDF Video Code

Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity—where languages have few reference articles—and information asymmetry—where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at https://nlp.cs.washington.edu/xorqa/.
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Ikuya Yamada , Akari Asai , Hiroyuki Shindo , Hideaki Takeda , and Yuji Matsumoto

In Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020

Abs PDF Code

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at this https URL.
Learning to retrieve reasoning paths over wikipedia graph for question answering

Akari Asai , Kazuma Hashimoto , Hannaneh Hajishirzi , Richard Socher , and Caiming Xiong

In International Conference on Learning Representations (ICLR) , 2020

Abs PDF Blog Code

Answering questions that require multi-hop reasoning at web-scale necessitates retrieving multiple evidence documents, one of which often has little lexical or semantic relationship to the question. This paper introduces a new graph-based recurrent retrieval approach that learns to retrieve reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions. Our retriever model trains a recurrent neural network that learns to sequentially retrieve evidence paragraphs in the reasoning path by conditioning on the previously retrieved documents. Our reader model ranks the reasoning paths and extracts the answer span included in the best reasoning path. Experimental results show state-of-the-art results in three open-domain QA datasets, showcasing the effectiveness and robustness of our method. Notably, our method achieves significant improvement in HotpotQA, outperforming the previous best model by more than 14 points.