Lovisa Hagström PhD student at Chalmers University of Technology

I perform research on Natural Language Processing under the supervision of Richard Johansson at the Data Science and AI division at Chalmers. I'm interested in language model controllability, interpretability and prediction provenance. In my research I investigate factual knowledge in language models and methods for adding knowledge to a model.


I'm happy to discuss my research and NLP in general so do reach out if you have anything on your mind!

Publications

2025

Language Model Re-rankers are Steered by Lexical Similarities
Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
Arxiv

Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Denitsa Saynova, Lovisa Hagström, Moa Johansson, Richard Johansson, Marco Kuhlmann
Arxiv

2024

A Reality Check on Context Utilisation for Retrieval-Augmented Generation
Lovisa Hagström, Sara Vera Marjanović, Haeun Yu, Arnav Arora, Christina Lioma, Maria Maistro, Pepa Atanasova, Isabelle Augenstein
Arxiv

2023

The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models
Lovisa Hagström, Denitsa Saynova, Tobias Norlund, Moa Johansson and Richard Johansson
EMNLP

A Picture is Worth a Thousand Words: Natural Language Processing in Context
Lovisa Hagström
Licentiate thesis at Chalmers University of Technology

2022

How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Lovisa Hagström and Richard Johansson
COLING

What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge
Lovisa Hagström and Richard Johansson
ACL Student Research Workshop

Can We Use Small Models to Investigate Multimodal Fusion Methods?
Lovisa Hagström, Tobias Norlund and Richard Johansson
CLASP Conference on (Dis)embodiment

2021

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?
Tobias Norlund, Lovisa Hagström and Richard Johansson
BlackboxNLP Workshop

Knowledge Distillation for Swedish NER models: A Search for Performance and Efficiency
Lovisa Hagström and Richard Johansson
Nordic Conference on Computational Linguistics (NoDaLiDa)

Teaching

I supervise master's theses at Chalmers and would be happy to discuss potential thesis projects with students or organizations interested in NLP. Simply send me an email!

I also work as a teaching assistant for the courses Applied mathematical thinking, Algorithms for machine learning and inference and Applied Machine Learning at Chalmers.

PhD Dissertation Defense

Title: Language Models and Knowledge Representations.
When: September 2025.
Where: Room TBA, Chalmers University of Technology, Gothenburg, Sweden.

My doctoral studies have mainly focused on the intersection between language models (LMs) and knowledge representations. Given that LMs are increasingly used as simpler interfaces to factual knowledge, we need models that not only are accurate, but also factually consistent, updatable and, ultimately, reliable. The body of work that will be discussed during my PhD defense touch upon these topics in different manners:

📑 #1

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?

BlackboxNLP Workshop 2021

LM hallucinations, i.e. inaccurate and potentially misleading model outputs, may originate from training on incomplete knowledge sources. For example, text data suffers from reporting bias such that it mainly reports novel information rather than trivial information. In this paper, we hypothesise that LMs may form more complete knowledge representations if they are trained on more comprehensive sources of knowledge, such as information from additional modalities apart from text. We investigate whether LMs can obtain and represent knowledge from training on visual information, to be able to express it in a textual modality. We propose a dataset, Memory Colors, for evaluating this transfer capability and a model, CLIP-BERT, that can be trained on visual information. Our results show that it is possible to leverage visual information training to complement the knowledge of a LM.
📑 #2

The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models

EMNLP 2023

LMs struggle to provide consistent answers to paraphrased questions asking for factual information. This is problematic, as LMs should be consistent when used as interfaces to factual knowledge. In this paper, we investigate potential approaches for improving the factual consistency of LMs. We find that retrieval augmentation, as represented by the Atlas model, works best for improving model consistency compared model upscaling, for which an Atlas model with 330M parameters and access to Wikipedia outperforms a Llama model with 65B parameters. We also find that syntactic form impacts consistency – LMs are more likely to fail to provide consistent predictions if the consistent prediction results in an unidiomatic sentence.
📑 #3

A Reality Check on Context Utilisation for Retrieval-Augmented Generation

Arxiv (under review for ACL 2025)

Retrieval-augmented generation (RAG), i.e. retrieving contexts from an external knowledge source, can be used to alleviate limitations of the parametric knowledge of LMs. However, investigations of how LMs utilise retrieved information of varying complexity in real-world scenarios have been limited to synthetic contexts. In this paper, we introduce DRUID, a dataset with real-world queries and contexts. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) to find that artificial datasets often fail to represent the complex and diverse real-world context settings, yielding misleading results when used to study LM context utilisation. Overall, this paper underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
📑 #4

Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion

Arxiv (under review for ACL 2025)

Moving back to studying the parametric knowledge of LMs, this paper inspects previous interpretations of LMs for fact completion from the perspective of diagnostic data. Previous model interpretations pay little attention to the samples used to elicit and study factual associations of LMs, and it is not known whether samples of different characteristics may elicit different factual associations in LMs. In this paper, we identify four different prediction scenarios for fact completion: Generic language modeling, guesswork, heuristics recall and exact fact recall. We introduce a method for creating datasets with samples of each prediction scenario and show how two popular interpretability methods (causal tracing, studies of information flow) yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, this paper contributes resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.
📑 #5

Language Model Re-rankers are Steered by Lexical Similarities

Arxiv (under review for ACL 2025)

Focusing again on RAG, we scrutinise the performance of LM re-rankers used to refine retrieval results for RAG. LM re-rankers are more expensive to use compared to simpler methods based on lexical matching, like BM25, while the assumption is that they are better at identifying helpful context. We evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets to measure the benefit of using LM re-rankers compared to simpler methods. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we find that LM re-rankers heavily depend on simple lexical similarities, such that a helpful passage with low lexical similarity to the query is likely to go undetected. Taken together, this paper identifies and explains novel weaknesses of LM re-rankers and points to the need of more adversarial and realistic evaluation datasets to accurately estimate LM re-ranker performance for RAG.