I am particularly interested in understanding the mechanisms behind language processing and reasoning, both in language models (interpretability) and in the human mind (cognitive modeling).
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.
@inproceedings{opedal2025mathgap,title={Math{GAP}: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs},author={Opedal*, Andreas and Shirakami*, Haruki and Schölkopf, Bernhard and Saparov, Abulhair and Sachan, Mrinmaya},year={2025},booktitle={The Thirteenth International Conference on Learning Representations},eprint={2410.13502},archiveprefix={arXiv},primaryclass={cs.LG},url={https://arxiv.org/abs/2410.13502},}
EMNLP
On the Role of Context in Reading Time Prediction
Andreas Opedal, Eleanor Chodroff, Ryan Cotterell, and Ethan Wilcox
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.
@inproceedings{opedal2024role,title={On the Role of Context in Reading Time Prediction},author={Opedal, Andreas and Chodroff, Eleanor and Cotterell, Ryan and Wilcox, Ethan},booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2024},publisher={Association for Computational Linguistics},address={Miami},eprint={2409.08160},archiveprefix={arXiv},primaryclass={cs.CL},url={https://arxiv.org/abs/2409.08160},}
ICML
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?
Andreas Opedal*, Alessandro Stolfo*, Haruki Shirakami, Ying Jiao, Ryan Cotterell, Bernhard Schölkopf, Abulhair Saparov, and Mrinmaya Sachan
In Forty-first International Conference on Machine Learning, Jul 2024
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.
@inproceedings{opedal2024language,title={Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?},author={Opedal*, Andreas and Stolfo*, Alessandro and Shirakami, Haruki and Jiao, Ying and Cotterell, Ryan and Schölkopf, Bernhard and Saparov, Abulhair and Sachan, Mrinmaya},booktitle={Forty-first International Conference on Machine Learning},month=jul,year={2024},url={https://arxiv.org/abs/2401.18070},}
ACL
Efficient Semiring-Weighted Earley Parsing
Andreas Opedal, Ran Zmigrod, Tim Vieira, Ryan Cotterell, and Jason Eisner
In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2023
We present Earley’s (1970) context-free parsing algorithm as a deduction system, incorporating various known and new speed-ups. In particular, our presentation supports a known worst-case runtime improvement from Earley’s (1970) O(N3|G||R|), which is unworkable for the large grammars that arise in natural language processing, to O(N3|G|), which matches the complexity of CKY on a binarized version of the grammar G. Here N is the length of the sentence, |R| is the number of productions in G, and |G| is the total length of those productions. We also provide a version that achieves runtime of O(N3|M|) with |M| ≤ |G| when the grammar is represented compactly as a single finite-state automaton M (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate the possibility of deduction cycles, and further generalize Stolcke’s method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.
@inproceedings{opedal-etal-2023-efficient,title={Efficient Semiring-Weighted {E}arley Parsing},author={Opedal, Andreas and Zmigrod, Ran and Vieira, Tim and Cotterell, Ryan and Eisner, Jason},booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=jul,year={2023},url={https://aclanthology.org/2023.acl-long.204},doi={10.18653/v1/2023.acl-long.204},}