πŸ“š Simon's Papers of the Week

2026 β–Ό
June β–Ό
Week 23 (2 papers)
β–Ό
Executive Summary
This paper introduces SKILLOPT, a novel framework for optimizing agent skills, which are defined as portable, natural-language artifacts that encapsulate procedural knowledge for large language model (LLM) agents. The core problem addressed is the difficulty in adapting LLM agents to specific domains or tasks without resorting to expensive or unavailable weight updates. SKILLOPT frames skill optimization as a text-space optimization problem, treating the skill document as an external, trainable state for a frozen agent model. The mechanism involves a separate 'optimizer' LLM that analyzes successful and failed agent rollouts (trajectories) and proposes bounded add/delete/replace edits to the skill document. These edits are then filtered and validated against a held-out selection set, ensuring that only improvements are accepted. This approach draws an analogy to weight-space optimization in deep learning, applying concepts like learning rates (via an edit budget), validation gates, and momentum (via an epoch-wise slow/meta update) to the text-editing process. The key insight is that by treating skills as optimizable text artifacts and employing controlled, iterative refinement, SKILLOPT can achieve significant performance gains across diverse benchmarks and models without modifying the underlying LLM weights. This method decouples the agent's core capabilities from its procedural adaptation, allowing for more efficient and robust domain specialization. The system's stability is enhanced by a rejected-edit buffer, which uses failed edits as negative feedback, and a slow/meta update mechanism that captures longer-horizon trends. SKILLOPT's effectiveness is demonstrated by its superior performance across numerous benchmarks and its ability to transfer learned skills across different models and execution environments, highlighting the reusability and inspectability of the optimized skill artifacts.
Method
Imagine you have a robot that can do tasks, but it needs a set of instructions (a 'skill') to do them well. SKILLOPT is like a coach for that robot's instructions. The coach watches the robot try tasks and notes what works and what doesn't. Then, the coach suggests small, specific changes to the instruction manual, like adding a new step or rephrasing an old one. Crucially, the coach only accepts changes that demonstrably make the robot better at the tasks, using a separate practice set to check. This way, the instruction manual gets gradually improved without changing the robot itself, making it more specialized and effective.
Executive Summary
This paper investigates the phenomenon that larger neural network models can learn tasks that smaller models fail to, even with abundant training data. The authors propose a data-centric explanation rooted in the interplay of model capacity, task frequency, and gradient interference. Their core argument is that power-law scaling inherently implies that certain parts of a data distribution, corresponding to rare or complex tasks, are only learnable by larger models. This is not due to inherent expressivity limitations of smaller models, but rather due to a resource allocation bottleneck during training. Specifically, smaller models tend to allocate their limited capacity to high-frequency or low-complexity tasks, leading to gradient updates that overwrite or interfere with the learning of rare-task features. Larger models, with their greater capacity, can circumvent this by allocating sufficient resources to common tasks, thereby weakening their gradient updates and allowing rare-task features to accumulate without being overwritten. This reduced interference mechanism is crucial for retaining information from infrequent observations. The authors validate this hypothesis through synthetic experiments with multi-task regression and by analyzing large language models (OLMo). Their findings indicate that larger models indeed learn infrequent and complex tasks better, exhibit richer task feature representation, and experience less gradient interference. This work shifts the focus from mere model expressivity to the dynamics of learning under resource constraints and data distribution, offering a mechanistic understanding of why scaling parameters is effective in practice and informing strategies for data mixture design and model sizing.
Method
Imagine you have a limited number of study slots (model capacity) and many subjects (tasks) to learn. Some subjects are very common (high frequency), and others are rare. Smaller models, with few slots, tend to fill them with the common subjects, making it hard to remember the rare ones when they briefly appear. Larger models have more slots, so they can dedicate enough slots to common subjects without forgetting them entirely. This leaves them better able to learn and retain information about the rare subjects when they do appear, even if they are only seen occasionally. The study uses simplified math problems and then real language models to show this effect.
Week 22 (1 paper)
β–Ό
Executive Summary
This research addresses the critical, yet under-examined, issue of agent drift in multi-agent Large Language Model (LLM) systems. Agent drift is defined as the progressive degradation of an agent's behavior, decision quality, and inter-agent coherence over extended interaction sequences, distinct from traditional software failures. The authors propose a theoretical framework encompassing three primary manifestations: semantic drift (deviation from original intent), coordination drift (breakdown in consensus), and behavioral drift (emergence of unintended strategies). To quantify this phenomenon, they introduce the Agent Stability Index (ASI), a composite metric derived from 12 behavioral dimensions, including response consistency, tool usage patterns, and inter-agent agreement. The ASI framework provides a systematic method for monitoring drift in production systems. Through extensive simulations across enterprise automation, financial analysis, and compliance monitoring domains, the study demonstrates that agent drift can emerge relatively early (median 73 interactions) and significantly impacts system performance, leading to substantial reductions in task success rates (42%) and increases in human intervention (216%). The research also identifies key architectural influences, such as hierarchy depth and memory systems, that affect drift susceptibility. Furthermore, three mitigation strategiesβ€”Episodic Memory Consolidation, Drift-Aware Routing, and Adaptive Behavioral Anchoringβ€”are proposed and theoretically validated, showing projected drift reduction of up to 70.4% for individual strategies and 81.5% when combined. This work lays foundational methodology for understanding, measuring, and mitigating agent drift, crucial for the reliable and safe deployment of increasingly autonomous multi-agent LLM systems.
Method
Imagine you have a team of AI assistants working together on a long project. This study checks if their performance gets worse over time, like a team getting tired or forgetting instructions. They created a way to measure this "drift" using a score called the Agent Stability Index (ASI), which looks at many things like how consistent their answers are and how well they cooperate. They ran many simulated projects to see how often this drift happens and how bad it gets. Then, they tested three ways to fix it: giving the AI team a better memory, making them smarter about who to ask for help, and reminding them of their original instructions. They found that these fixes could significantly improve the AI team's performance over long periods.
Week 21 (8 papers)
β–Ό
Executive Summary
This paper introduces RECFORMER, a novel framework for sequential recommendation that shifts from traditional item ID-based representations to language-based representations. The core problem addressed is the limitation of ID-based methods in handling cold-start items and transferring knowledge across domains, primarily due to their inability to generalize beyond seen item IDs. RECFORMER tackles this by treating each item as a 'sentence' derived from its key-value textual attributes (e.g., title, brand, color). This formulation allows the model to leverage the inherent semantic richness of text, enabling it to understand item similarities and user preferences in a more generalizable manner. The architecture is based on a bi-directional Transformer, specifically a Longformer variant, adapted to process sequences of these item 'sentences'. A key innovation is the embedding layer, which eschews traditional item embeddings in favor of token, position, type, and item position embeddings, allowing the model to learn item representations directly from text and their sequential context. To effectively train this model, RECFORMER employs a two-stage learning framework: pre-training with Masked Language Modeling (MLM) and an item-item contrastive task to build a strong foundation of language understanding and recommendation capabilities, followed by a two-stage finetuning process. The finetuning stage is crucial for adapting the general language representations to specific downstream recommendation tasks, particularly in low-resource or cold-start scenarios. The proposed method's strength lies in its ability to learn transferable knowledge, as demonstrated by its superior performance in zero-shot and cold-start settings, suggesting that language representations can indeed bridge the gap between different recommendation domains and unseen items without requiring explicit item IDs or extensive domain-specific training data.
Method
Imagine recommending products to someone online. Instead of just using a product's ID number, this system looks at its description, like 'red running shoes, size 10, Nike brand.' It turns these descriptions into 'sentences' for each product. Then, it treats a user's shopping history as a series of these product 'sentences.' The system uses a smart language model, similar to how chatbots understand text, to learn what the user likes from their history. It can even learn about new products it hasn't seen before just by reading their descriptions, making recommendations more accurate, especially for new or less popular items.
Executive Summary
This paper addresses a critical challenge in short-video recommender systems: the accurate prediction of watch time, a regression task plagued by a highly long-tailed distribution of labels. Standard regression models often exhibit a systematic bias where errors in predicting short watch times are counteracted by errors in predicting long watch times, leading to a false sense of global calibration. The proposed DADF framework offers a novel, second-stage multiplicative residual correction mechanism that operates on top of an existing deployed predictor, avoiding the need for a complete model replacement. DADF's core innovation lies in its three complementary components designed to tackle the distributional bias. The first component, a dynamic distribution-aware transformation, aims to stabilize the targets for correction, particularly for the long tail of the watch-time distribution. This transformation likely involves a non-linear mapping that compresses extreme values or expands denser regions, making the residual learning task more tractable. The second component, a debias-factor-aware module, explicitly models the heterogeneous residual patterns. It leverages inference-time observable factors, most notably video duration, to predict the multiplicative correction factor. This acknowledges that the nature of prediction errors can vary significantly based on inherent video characteristics, allowing for more localized and accurate adjustments. The third component, a multi-label-aware module, capitalizes on auxiliary prediction signals from engagement heads (e.g., likes, shares, comments). By integrating these related signals, DADF can infer a more robust correction, assuming that user engagement patterns are correlated with watch time and its prediction errors. The framework's strength lies in its modularity and its focus on correcting residual errors rather than retraining the entire predictor. This makes it a practical, plug-in solution for real-world systems. The empirical validation on both public benchmarks and a large-scale industrial system demonstrates significant improvements in pointwise accuracy (reduced MAE) and ranking quality (WUAUC gain), culminating in a measurable lift in user engagement (average time spent per device). The key insight is that addressing the *distributional* nature of prediction errors, rather than just the average error, is crucial for improving recommender system performance, especially in data-rich, long-tailed scenarios. DADF provides a principled way to achieve this by learning to adapt predictions based on observable data characteristics and related engagement signals.
Method
Imagine you have a system that guesses how long someone will watch a video, but it's not very good, especially for very short or very long videos. DADF is like a smart assistant that looks at the original guess and makes a small adjustment to make it better. It does this by considering how long the video is and other things people do with videos, like liking or sharing them. It learns how to adjust the guess based on these factors. This assistant doesn't replace the original guesser but works alongside it to improve the final prediction, making recommendations more accurate and keeping users engaged longer.
Executive Summary
This research investigates the latent ideological profiling capabilities of a large-scale social media recommender system, specifically focusing on the platform 'X'. The study moves beyond traditional analyses of recommendation quality (e.g., diversity, bias) to explore how the underlying AI models learn, represent, and process sensitive user attributes, particularly political ideology. The core methodological innovation lies in leveraging a unique data donation program where 682 volunteers provided over 2.5 million friend recommendations received over a year. By combining this empirical data with publicly available information on the recommender's architecture, the researchers were able to infer the positions of users within the system's embedding space. This spatial representation was then mapped to real-world political ideology using established ideology scaling techniques calibrated with political survey data. The findings reveal a striking correlation between the recommender's user embeddings and users' Left-Right political positions, quantified by a Pearson correlation coefficient of 0.887 (p < 0.0001). Crucially, this correlation could not be attributed to observable socio-demographic factors like age or gender, suggesting that the recommender system implicitly learns and encodes political leanings. This work highlights a significant, unintended consequence of recommender systems: their capacity for passive, algorithmic profiling of deeply held beliefs. The implications extend to privacy regulations, blurring the lines between active and passive data collection and raising questions about consent and data usage. The study also proposes novel constrained recommendation methods designed to mitigate this ideological profiling while preserving recommendation relevance, offering a potential pathway for privacy-preserving AI in social media.
Method
Imagine a social media platform suggests friends to you. Researchers collected millions of these suggestions given to hundreds of volunteers over a year. They then used information about how the platform's suggestion system works to figure out where each person was placed in the system's internal 'map'. By comparing this map position to people's known political views (from surveys), they found that the map strongly reflected whether someone was considered 'left' or 'right' politically. This happened even when ignoring basic things like age or gender, suggesting the system learned political leanings on its own.
Executive Summary
The paper addresses a critical gap in applying the scalability principles, or 'scaling laws,' observed in Large Language Models (LLMs) to industrial-scale recommendation systems (RecSys). Traditional End-to-End Generative Recommendation (E2E-GR) methods, while leveraging autoregressive sequence modeling, often compromise essential industrial requirements such as mature feature engineering, modularity, and production-grade efficiency. This trade-off leads to limitations in practical deployment, including performance degradation on discriminative tasks, prohibitive computational costs for real-time serving, lack of flexibility for evolving business needs, and poor compatibility with existing production systems. To bridge this gap, the authors propose a novel three-step paradigm centered around a Large User Model (LUM). This paradigm decouples the generative pre-training of LUM from its downstream application in discriminative RecSys tasks. Step 1, 'Knowledge Construction,' involves pre-training LUM using a Transformer architecture on user behavior sequences (UBS) with a 'next-condition-item prediction' objective. This objective is crucial as it reformulates the standard 'next-item prediction' by introducing condition tokens alongside item tokens, allowing the model to learn context-aware representations. This step aims to capture the complex joint probability distribution of user behavior, akin to LLMs learning world knowledge. Step 2, 'Knowledge Querying,' focuses on extracting relevant user interests and collaborative information from the pre-trained LUM. This is achieved by formulating queries using condition tokens, which act as structured prompts to steer the generative model towards task-specific outputs. This mechanism is analogous to prompt engineering in LLMs, enabling the model to generate contextually relevant insights. The key innovation here is the 'next-condition-item prediction' task, which allows for dynamic control over the generation process based on specific conditions (e.g., scenario, query intent, category). Step 3, 'Knowledge Utilization,' integrates the extracted knowledge from LUM into traditional Deep Learning-based Recommendation Models (DLRMs). This integration is designed to preserve the strengths of DLRMs, such as feature engineering and modularity, while enhancing their predictive accuracy and flexibility. The decoupled nature of the three steps is central to LUM's success. It allows for parallelization of training and inference, pre-computation of knowledge queries, and caching of results, thereby mitigating the strict latency and throughput constraints of industrial RecSys. This separation ensures that only the final DLRM component needs to meet real-time serving requirements, while the computationally intensive LUM pre-training and querying can be performed offline or asynchronously. The empirical results demonstrate that LUM not only outperforms existing DLRMs and E2E-GRs but also exhibits a clear scaling law, improving performance consistently with model size, making it a practical and scalable solution for industrial recommendation systems.
Method
Imagine you have a super-smart assistant (the Large User Model or LUM) that has read tons of information about what people like and how they behave online. First, we train this assistant to understand user preferences by looking at their past actions, but we teach it to pay attention to specific contexts or 'conditions' (like what they are searching for right now). Then, when we need to recommend something, we ask the assistant specific questions (using these 'conditions') to get tailored suggestions. Finally, we take these tailored suggestions and use them to make our existing recommendation system (like a shop's product display) even better and more accurate, without changing how the shop itself works too much.
Executive Summary
This paper introduces SynthID-Text, a novel generative watermarking scheme designed for large language models (LLMs) to identify synthetic text while preserving output quality and maintaining computational efficiency. The core problem addressed is the increasing difficulty in distinguishing LLM-generated text from human-written text, which poses risks to information ecosystems. Existing methods, such as retrieval-based approaches and post-hoc detectors, suffer from limitations in scalability, privacy, computational cost, and robustness to out-of-domain data. SynthID-Text tackles this by embedding a statistical signature directly into the text generation process, specifically by modifying the token sampling procedure. This generative approach avoids the need for LLM access during detection and is computationally efficient. The key innovation is the 'Tournament sampling' algorithm, which uses a multi-stage, tournament-like process to select tokens. This process involves scoring candidate tokens with a set of pseudorandom functions (g-values) derived from a random seed, and iteratively selecting the highest-scoring tokens. This mechanism embeds a subtle statistical bias that can be detected later without the LLM. The scheme is designed to be non-distortionary, meaning it does not degrade text quality, and has been empirically validated through large-scale user feedback from nearly 20 million Gemini responses. Furthermore, SynthID-Text integrates with speculative sampling, a common LLM efficiency technique, to ensure negligible additional latency in production systems. The authors demonstrate that SynthID-Text offers improved detectability over existing generative watermarking methods like Gumbel sampling and Soft Red List, particularly in lower-entropy generation settings. The system has been successfully deployed in Google's Gemini chatbots, marking a significant step towards responsible LLM deployment.
Method
Imagine you're writing a story, and you want to secretly mark your words so people know you wrote them. SynthID-Text is like a special pen that subtly changes how you pick words. When the AI writes, instead of just picking the most obvious next word, it uses a secret key and a fun 'tournament' game. This game helps it pick words that are slightly unusual but still make sense, creating a hidden pattern. Later, someone can use the same secret key to check for this pattern in the text. If the pattern is there, they know the AI wrote it, but the story still reads naturally because the changes are very subtle.
Executive Summary
This paper introduces a novel watermarking framework designed to embed imperceptible signals within the output of large language models (LLMs), enabling algorithmic detection of machine-generated text. The core mechanism involves a "soft" watermarking strategy that subtly biases token selection towards a predefined "green list" of tokens. This bias is applied by slightly increasing the logits of tokens in the green list before the softmax operation, thereby promoting their selection during text generation. Crucially, this bias is adaptive: it has a more pronounced effect on tokens where the LLM has multiple plausible choices (high entropy), and a negligible effect on tokens where the LLM has a near-deterministic choice (low entropy). This adaptive approach ensures that the watermark has minimal impact on text quality and fluency, addressing a key limitation of previous hard-watermarking methods that could degrade output significantly. The detection mechanism is designed to be efficient and accessible, operating without requiring access to the LLM's API or parameters. It relies on a statistical test, specifically a one-proportion z-test, to determine if the observed frequency of green list tokens deviates significantly from what would be expected in naturally generated text. The "green list" itself is dynamically generated for each token based on a hash of the preceding token(s), allowing for reproducible detection without needing to store the entire watermark. This approach provides interpretable p-values and is theoretically grounded in information theory, allowing for analysis of sensitivity and robustness. The authors demonstrate the effectiveness of their method on a multi-billion parameter model (OPT family), showing high detection rates with low false-positive rates, even on short text spans. A key insight is the synergistic interaction between the soft watermarking approach and beam search decoding. Beam search, by exploring multiple high-probability token sequences, can amplify the effect of the soft watermark, leading to stronger detection signals with even less impact on text quality. The paper also rigorously analyzes the watermark's sensitivity to text entropy, demonstrating that while high-entropy text is easily detectable, low-entropy text (e.g., memorized content) poses a challenge. The framework is designed to be robust against various attacks, including simple text modifications, and offers options for both public and private watermarking implementations. The proposed method represents a significant step towards making LLM outputs auditable and traceable, addressing growing concerns about the misuse of generative AI.
Method
Imagine you're writing a story, and you have a secret rule: you should try to use certain "green" words more often than others. This method works like that. When a computer writes text, it usually picks the most likely next word. This watermarking method slightly nudges the computer to favor "green" words, but only when it has many good word choices. If there's only one obvious word to use, it doesn't force the "green" word. To check if text was written by this system, you just count how many "green" words appear. If there are significantly more "green" words than expected by chance, it's likely from the watermarked system. This is like a hidden signature that doesn't make the writing sound weird.
Executive Summary
This paper addresses the limitations of traditional atomic item identifiers (IDs) in large-scale recommender systems (RecSys), such as excessive parameterization, cold-start issues, and poor generalization due to long-tail distributions. The core innovation presented is the application of Semantic IDs (SIDs), which are ordered sequences of discrete tokens derived from semantic representations (e.g., from foundation models) via quantization techniques like Residual Vector Quantization (RQ-VAE). SIDs offer a drastically reduced cardinality compared to atomic IDs, enabling semantic clustering and more efficient representation. The authors detail their practical experience deploying SIDs at Snapchat, focusing on two primary use cases: as auxiliary features in ranking models and as primary retrieval sources in generative retrieval (GR) frameworks. A significant technical challenge encountered was codebook collapse during RQ-VAE training, where only a fraction of the codebook was utilized, limiting semantic expressiveness. To mitigate this, they introduced two key architectural modifications: employing a straight-through estimator (STE) to enable backpropagation through the quantization process and fusing multiple embedding sources (e.g., text, image, metadata) to increase input variance. Another challenge was SID-to-item resolution in GR, where multiple items can map to the same SID. They addressed this with heuristic-based intra-code disambiguation and prioritizing retrieval depth over breadth. The paper reports significant offline and online A/B test improvements across various Snapchat surfaces, including ads, content, and friend recommendations, validating the efficacy of SIDs. A key insight is that while SID uniqueness is a useful proxy for mitigating codebook collapse, it is not a definitive golden standard for SID quality, suggesting a need for more nuanced evaluation metrics.
Method
Imagine you have a huge library of books, and each book has a unique ID number. This can get overwhelming! Semantic IDs (SIDs) are like creating a special code for each book based on its content, genre, and author, rather than just a random number. This code is made up of a few shorter codes, like a Dewey Decimal System but smarter. If two books are very similar (e.g., two sci-fi novels by the same author), they might get similar SID codes. This helps computers understand relationships between items better. The paper explains how they built these SIDs using a special AI model (RQ-VAE) and improved it to avoid common problems like the AI only learning a few codes (codebook collapse). They then used these SIDs to help recommend things on Snapchat, either as extra clues for existing recommendation systems or as the main way to find new content. They also figured out how to handle cases where multiple books might get the same SID code, by using extra information to pick the best one.
Deep Research for Recommender Systems
πŸ“… 2026-05-21 πŸ“ Data Science/LLM πŸ”— Source
May β–Ό
Week 20 (1 paper)
β–Ό
Executive Summary
The document introduces Parlant, a Python framework designed to enhance the reliability and control of customer-facing AI agents, particularly those interacting with Large Language Models (LLMs). The core problem Parlant addresses is the inherent difficulty in managing the diverse, nuanced, and non-linear nature of real-world conversations, which often leads to LLM agents deviating from intended behavior, failing to adhere to complex instructions, or exhibiting unwanted actions. Traditional approaches like extensive system prompts or complex routed graphs are shown to be brittle and difficult to scale when faced with the unpredictability of human interaction. Parlant's central mechanism is its sophisticated context engineering approach. Instead of feeding a monolithic system prompt and the entire conversation history to the LLM, Parlant employs a dynamic, turn-by-turn context assembly process. This involves a 'Contextual Matching Engine' that evaluates various componentsβ€”Observations (events), Guidelines (rules), Journeys (SOPs), Retrievers (knowledge), Glossary (terms), and Variables (memory)β€”to determine precisely which pieces of information are relevant for the current conversational turn. This focused context is then passed to the LLM, ensuring it operates with the most pertinent information and adheres to defined behavioral constraints. The framework's design prioritizes maximum control and prevention of unwanted behaviors. This is achieved through a system of Guidelines, which are condition-action pairs. When a condition is met, the corresponding action is incorporated into the agent's context. Crucially, Relationships between Guidelines (exclusions and dependencies) allow for fine-grained control, ensuring that conflicting behaviors are resolved and that specific actions are only taken when prerequisite conditions are met. This structured approach makes it inherently harder for the agent to deviate from its intended boundaries, moving beyond simple output guardrails to proactive behavioral control. Parlant's key insight is that conversational control is best achieved by architecting the context assembly process itself, rather than solely relying on prompt engineering or post-hoc output filtering. By dynamically filtering and assembling context based on defined rules and conversational state, Parlant aims to provide a robust, scalable, and predictable foundation for building reliable customer-facing AI agents. The framework is designed to integrate with existing LLM providers and workflow automation tools, positioning itself as a specialized layer for interaction governance.
Method
Imagine you're directing a play. Instead of giving the actor one giant script for the whole show, Parlant gives them a director's notebook. This notebook has specific instructions (Guidelines) for different situations (Conditions). When a character speaks (User Input), the director (Parlant's Engine) quickly checks the notebook to find the *exact* instruction that applies *right now*. If multiple instructions could apply, the director uses rules to pick the most important one. Sometimes, an instruction might say, 'If the actor is playing a doctor, use this specific line (Canned Response),' or 'If the customer asks about X, use this tool to find out more.' This way, the actor (LLM) always knows exactly what to say or do, making the conversation predictable and controlled, even when things get complicated.
Week 18 (4 papers)
β–Ό
Executive Summary
This document provides a comprehensive overview of Reinforcement Learning (RL), framing it as a sequential decision-making process where an agent interacts with an environment to maximize cumulative rewards. The core problem addressed is how an agent can learn an optimal policy in an unknown environment, contrasting RL with supervised and self-supervised learning by highlighting its generality and the numerous assumptions that can be made about the environment and agent. The central mechanism revolves around the agent learning a value function (state-value or state-action value) or directly optimizing a policy. Value-based methods, like Q-learning and SARSA, learn the expected return of states or state-action pairs, using Bellman equations as a foundation for updates. Policy-based methods, such as policy gradients, directly optimize the policy parameters to maximize expected return. Model-based RL approaches first learn a model of the environment's dynamics and then use this model for planning or generating synthetic data to train a policy. The document meticulously categorizes RL methods along dimensions such as what the agent learns (value function, policy, model, or a combination), how functions are represented (tabular vs. parametric, deep RL), and how actions are selected (on-policy vs. off-policy). A key insight is the distinction between different problem formulations (MDPs, POMDPs, contextual bandits) and the various challenges posed by state uncertainty, model uncertainty (exploration-exploitation tradeoff), and reward function design (reward hacking, sparse rewards). The document emphasizes that the choice of assumptions and methods ultimately depends on the specific real-world application. It also delves into advanced topics like multi-agent RL, hierarchical RL, imitation learning, and the emerging intersection of LLMs and RL, showcasing the breadth and depth of the field.
Method
Imagine you want to teach a robot to play a game. RL is like teaching it by giving it points (rewards) when it does well and taking points away when it does poorly. The robot tries different moves (actions) in different situations (states) and learns which moves lead to the most points over time. It can do this by: 1. **Learning what's good:** Figuring out how valuable each situation or move is (like learning the score of a chess position). 2. **Trying things out:** Directly learning which moves are best in each situation (like learning a strategy). 3. **Building a map:** Trying to understand how the game works (like learning the rules and consequences of moves). RL can be used for single robots or multiple robots playing together, and even for teaching AI models to write or reason. The goal is always to learn the best way to act to get the most rewards.
Executive Summary
This paper investigates the persistent problem of survey instability, where approximately 25% of respondents provide inconsistent answers to identical questions over time, even when memory and material changes are controlled for. The authors challenge the prevailing explanations that attribute this instability solely to respondents lacking stable preferences ('nonattitudes') or to flawed survey instruments ('measurement error'). Through a comprehensive analysis of observational and experimental data from 59 unique surveys, they demonstrate that existing explanations are insufficient. Instead, they propose and provide evidence for 'intrinsic human stochasticity' as the primary driver of survey instability. This concept, drawing parallels from other scientific fields, posits that inherent, irreducible randomness in human cognitive and decision-making processes, rather than respondent ignorance or instrument flaws, accounts for a significant portion of observed response variability. The research meticulously deconstructs the sources of this stochasticity, tracing them from proximate decision-making processes to more distal cognitive, psychological, and individual characteristics. By employing a novel research strategy leveraging online survey platforms, the authors conduct sequential experiments to isolate the effects of various factors, including question complexity, time-on-task, priming, psychological states (preoccupation, mind-wandering, persona, attention), and individual demographics. Their findings reveal that while measurement error and respondent inattention play a role, the core of instability lies in the fundamental, omnipresent stochastic nature of human cognition. This insight has profound implications for survey design, analysis, and our understanding of public opinion, suggesting a shift from solely focusing on eliminating 'error' to understanding and accounting for this inherent variability.
Method
The researchers wanted to understand why people sometimes give different answers to the same survey question. They designed experiments where people answered the same question twice, very close together, so they wouldn't forget the first answer or experience any major life changes. They then tested different ideas: whether the question was confusingly worded, if something asked just before influenced the answer, or if people were just not paying close attention. Ultimately, they found that even when questions are clear and people are paying attention, there's a natural, built-in randomness in how our brains make decisions that causes these inconsistencies. This inherent 'human randomness' is the main reason for survey instability.
Executive Summary
This paper addresses a fundamental misalignment between large reasoning models (LRMs) and existing Retrieval-Augmented Generation (RAG) frameworks. LRMs, capable of generating extensive chains of thought (CoT) spanning thousands of tokens, require evidence injection at various points during their multi-step reasoning process. However, traditional RAG systems are designed for a single retrieval step performed *before* generation begins, failing to accommodate the dynamic, mid-inference knowledge needs of LRMs. The proposed solution, REALM-RETRIEVE, introduces a reasoning-aware retrieval framework that adapts retrieval timing to the specific needs of the reasoning process. REALM-RETRIEVE's core innovation lies in its ability to detect knowledge gaps at the granularity of reasoning steps, rather than at the token or sentence level as in prior iterative retrieval methods. This is achieved through a novel Reasoning Step Uncertainty Score (RSUS), a composite measure that combines verbalized model confidence, entity-based entropy (reflecting corpus coverage of named entities), and a consistency signal. This RSUS score is computed at each identified reasoning step, allowing the system to determine if external evidence is necessary to proceed. Complementing the uncertainty detection is a learned retrieval intervention policy. This policy, framed as a contextual bandit problem, decides *when* to retrieve, *what* query to formulate, and *how* to integrate the retrieved context, all conditioned on the current reasoning state. This adaptive policy, trained using a REINFORCE algorithm, aims to maximize answer accuracy while minimizing retrieval cost. Furthermore, the framework incorporates efficiency-optimized integration mechanisms, including implicit compression of retrieved documents and speculative caching of retrieval results, which significantly reduce per-retrieval overhead. The key insight is that by precisely identifying the moments of knowledge uncertainty *during* the reasoning process, and by learning an optimal policy for intervention, retrieval can be made far more effective and efficient. This contrasts sharply with fixed-interval retrieval methods that often retrieve when evidence is not needed or fail to retrieve when it is critical. The empirical results demonstrate substantial improvements in accuracy (e.g., +5.8% F1 on MuSiQue) and efficiency (e.g., 47% fewer retrieval calls compared to IRCOT), establishing new state-of-the-art trade-offs for reasoning-intensive tasks.
Method
Imagine a student trying to solve a complex math problem by writing down their steps. Sometimes, they get stuck and need to look up a fact in a textbook. This system is like a smart assistant for that student. First, it figures out when the student has completed a logical step in their work. Then, it asks the student how confident they are about their current step and checks if they're mentioning specific terms that might be hard to find information on. If the student seems unsure or is dealing with complex terms, the assistant decides it's time to look for information. It then formulates a smart question to find the right fact and quickly integrates it into the student's notes before they continue. This way, the assistant only interrupts when truly necessary, making the learning process faster and more accurate.
Executive Summary
The deployment of large language models (LLMs) in high-stakes applications necessitates robust guardrails that enforce custom policies. Traditional approaches face limitations: generic safety models lack task specificity, while direct LLM prompting yields inconsistent performance and high inference costs. Training custom classifiers offers accuracy and efficiency but demands substantial labeled data, which is expensive to acquire. This paper introduces BARRED, a framework designed to generate high-fidelity synthetic training data for custom guardrails using only a task description and a small set of unlabeled examples. BARRED addresses the core challenges of diversity and label faithfulness in synthetic data generation. It achieves diversity by decomposing the task domain into relevant dimensions and systematically sampling instantiations within these dimensions, thereby ensuring comprehensive coverage of the problem space. Label faithfulness is ensured through a novel multi-agent debate mechanism. In this system, an 'Advocate' LLM defends a generated sample's label, while a panel of 'Judge' LLMs critically evaluate the sample and the Advocate's reasoning. This asymmetric debate forces rigorous scrutiny and iterative refinement of generated samples, leading to a high-quality training corpus. Experiments demonstrate that compact models fine-tuned on BARRED-generated data significantly outperform state-of-the-art proprietary LLMs and dedicated guardrail models across various custom policies, including conversational enforcement, agentic output verification, and regulatory compliance. Ablation studies confirm that both dimension decomposition and the debate-based verification are critical for achieving the observed performance gains. BARRED offers a scalable solution for creating accurate custom guardrails, eliminating the need for extensive human annotation and enabling rapid adaptation to new policies.
Method
Imagine you want to teach a computer to follow specific rules, like a security guard. Instead of showing it thousands of real-life examples (which is expensive), we create fake examples. First, we break down the rules into different aspects or 'dimensions' (like 'what kind of behavior' or 'where it happens'). Then, we generate tricky scenarios that are right on the edge of breaking the rules, making sure to cover all the different aspects. To ensure these fake examples are correct, we have multiple AI 'lawyers' debate about each one. One AI argues for the rule being broken, and others try to find flaws in its argument. If they can't agree, the example is fixed until it's clear. Finally, we use these high-quality fake examples to train a small, efficient AI guard.
Week 17 (2 papers)
β–Ό
Executive Summary
This paper addresses the critical bottleneck of KV cache growth in large language models (LLMs) during chain-of-thought (CoT) reasoning. Current methods rely on hand-designed heuristics or proxy objectives to manage this memory, which are suboptimal as they are decoupled from the model's actual reasoning process and the task's ultimate reward. The core innovation, Neural Garbage Collection (NGC), frames KV cache management as a learned capability, integrated directly into the LLM's end-to-end reinforcement learning (RL) training loop. Instead of relying on external criteria, NGC enables the LLM to learn *when* and *what* to forget based solely on the outcome-based task reward. This is achieved by treating cache eviction as a discrete action sampled from the LLM itself, alongside token generation. The model learns to score KV cache entries, sample which to evict, and then continue reasoning with the pruned cache. This unified RL framework allows the model to jointly optimize its reasoning process and its memory management strategy, creating a feedback loop where effective forgetting improves reasoning, and better reasoning leads to higher rewards, which in turn refines the forgetting strategy. The key insight is that the same RL mechanism used to train LLMs for reasoning can be repurposed for memory management without introducing auxiliary losses or separate training stages. This is analogous to how AlphaZero learned to play Go by optimizing solely for game outcomes. NGC leverages the transformer's existing attention mechanism to score KV cache entries, effectively repurposing it for eviction decisions without adding new parameters. The training process uses a Gumbel-top-k trick for stochastic sampling of eviction actions and a replay attention mask mechanism to correctly compute gradients for off-policy actions, preventing training collapse. The paper demonstrates that this end-to-end learned approach significantly outperforms heuristic eviction methods on tasks like Countdown and mathematical reasoning, achieving substantial KV cache compression (2-3x) while maintaining high accuracy. This work represents a significant step towards models that are not only more capable but also inherently more efficient, with resource management becoming an emergent property of the model's learning process.
Method
Imagine an LLM is solving a complex problem, like a math puzzle, by thinking step-by-step. As it thinks, it keeps notes in a temporary memory (the KV cache). However, this memory can get full. Instead of having a fixed rule for deleting notes, this method teaches the LLM to decide for itself when to throw away old notes that are no longer useful for solving the puzzle. It does this by practicing and getting rewarded only when it solves the puzzle correctly. The LLM learns to judge which notes are important and which to discard, just like a person might decide what to keep and what to forget to focus on the task. This learned skill of forgetting helps it solve more complex problems with less memory.
Executive Summary
Typesense is presented as an open-source, in-memory search engine designed for high performance and ease of use, positioning itself as a direct alternative to proprietary solutions like Algolia and Pinecone, and a more user-friendly option compared to Elasticsearch. The core mechanism of Typesense revolves around an in-memory data structure optimized for fast lookups and typo-tolerant searching. This is achieved through a combination of techniques, including prefix trees (tries) for efficient prefix matching and typo-tolerant search, and a carefully engineered C++ backend that minimizes overhead and maximizes CPU utilization. Unlike traditional search engines that might rely heavily on disk I/O or complex distributed configurations, Typesense prioritizes speed by keeping the index in RAM, enabling sub-50ms search latencies. Its design philosophy emphasizes developer experience, offering a simple API, sensible defaults, and straightforward deployment, contrasting with the extensive configuration and operational overhead often associated with Elasticsearch. The system's approach to typo tolerance is particularly noteworthy. Instead of relying on complex fuzzy matching algorithms that can be computationally expensive, Typesense leverages techniques like prefix matching and a configurable edit distance to quickly identify potential matches even with minor spelling errors. This is crucial for building user-friendly search interfaces where users may not always type queries perfectly. The ranking and sorting mechanisms are also designed for flexibility, allowing dynamic adjustments at query time, which is a significant advantage for features like real-time price sorting or relevance tuning without re-indexing. The introduction of vector search and semantic/hybrid search capabilities further extends its utility, enabling modern AI-driven search experiences like similarity search and natural language understanding directly within the search engine, reducing the need for external ML models for basic inference. Typesense's architecture is built for scalability, supporting Raft-based clustering for high availability and seamless version upgrades. The emphasis on a single, self-contained binary simplifies deployment and management, making it accessible for a wider range of projects, from small applications to large-scale enterprise solutions. The project's commitment to open source, with a GPL license for the server and Apache for client libraries, aims to foster community contribution while protecting the core intellectual property and encouraging a collaborative development model. This strategic licensing choice balances the desire for community involvement with the need for project sustainability.
Method
Imagine you have a huge library of books and you want to find a specific book very quickly, even if you misspell the title. Typesense works like a super-fast librarian who keeps all the book titles and information in their head (in-memory). When you ask for a book, even with a typo, the librarian uses clever shortcuts and remembers common misspellings to find the right book almost instantly. It's designed to be easy to set up and use, so you can add search to your app without needing a whole team of experts.
April β–Ό
Week 15 (3 papers)
β–Ό
Executive Summary
Verifying the success of computer use agents (CUAs) is a critical bottleneck for reliable evaluation and training. This paper introduces the Universal Verifier (UV), a system designed to address this challenge by decomposing the verification process into distinct, principled components. The core problem lies in the inherent complexity and ambiguity of CUA trajectories, which involve rich visual information, sequential actions, and environmental interactions that can lead to subtle failures. Traditional verification methods often struggle with these nuances, leading to inaccurate assessments. The UV tackles this by employing a multi-faceted approach centered around four key design principles: 1) well-defined, non-overlapping rubrics to ensure consistent scoring; 2) separation of process and outcome rewards to capture distinct aspects of agent performance; 3) a cascading-error-free strategy for discerning controllable from uncontrollable failures; and 4) a comprehensive context management scheme that considers all visual evidence in a trajectory. These principles are implemented through a pipeline that first generates a rubric, then performs multimodal scoring using screenshot evidence to ascertain process success, and finally produces an outcome judgment and diagnostic report. The UV's architecture is designed to be robust against hallucinations and subtle errors by employing a two-pass scoring mechanism (with and without screenshots) and a relevance matrix to identify critical visual evidence. The paper demonstrates that the UV achieves agreement with human labels comparable to inter-human agreement, significantly outperforming existing baselines in accuracy and reducing false positive rates to near zero. This architectural advantage, rather than simply using a more powerful LLM backbone, is highlighted as the key to its superior performance. The work also explores the potential of AI agents to automate verifier design, finding that while AI can achieve a significant portion of expert quality rapidly, it struggles to discover the high-level structural design principles that yield the most substantial improvements, suggesting a complementary role for human expertise in this domain.
Method
Imagine you're grading a student's homework. First, you create a very specific checklist (the rubric) of what needs to be done, making sure each item is distinct. Then, you look at how the student did the work (process score) – did they follow the steps correctly, even if they hit a roadblock? Separately, you decide if they actually finished the assignment successfully (outcome score). If something went wrong, you figure out if it was the student's fault or something outside their control, like a broken tool. Finally, you use all the evidence, including pictures of their work (screenshots), to make sure they didn't make things up (hallucinate).
Executive Summary
This paper introduces the Memory Intelligence Agent (MIA), a novel framework designed to enhance the reasoning capabilities and autonomous evolution of Deep Research Agents (DRAs) by addressing limitations in existing memory systems. DRAs, which integrate Large Language Models (LLMs) with external tools, often struggle with long-horizon tasks due to ineffective memory evolution, leading to increased storage and retrieval costs, and diluted attention. MIA tackles these issues through a Manager-Planner-Executor architecture. The Memory Manager acts as a non-parametric memory system, storing compressed historical search trajectories. The Planner, a parametric memory agent, generates search plans, while the Executor, another agent, executes these plans using tools. A key innovation is the alternating Reinforcement Learning (RL) paradigm that optimizes the interplay between the Planner and Executor, ensuring alignment between high-level planning and low-level execution. Furthermore, MIA enables the Planner to continuously evolve during test-time learning, updating its parameters on-the-fly without interrupting the reasoning process. This is achieved through a bidirectional conversion loop between parametric and non-parametric memories, facilitating efficient memory evolution. The framework also incorporates reflection and unsupervised judgment mechanisms to boost reasoning and self-evolution in open-world scenarios. Extensive experiments across eleven benchmarks demonstrate MIA's superiority, significantly enhancing LLM performance on deep research tasks. Notably, MIA boosts GPT-5.4 performance by up to 9% and 6% on LiveVQA and HotpotQA, respectively. Even with a lightweight Executor like Qwen2.5-VL-7B, MIA achieves an average improvement of 31%, outperforming the larger Qwen2.5-VL-32B by 18%. The training analysis confirms that RL synergistically optimizes Planner and Executor strategies, improving cross-domain reasoning. MIA also outperforms previous long-context memory methods in multi-turn tool interaction and achieves performance comparable to supervised counterparts in unsupervised settings, demonstrating progressive self-evolution.
Method
Imagine an agent that needs to solve complex problems by using tools like a search engine. This agent has a "brain" (Planner) that makes a plan and "hands" (Executor) that carry out the plan. To remember past successes and failures, it uses a "notebook" (Memory Manager) to store important experiences as short summaries. The agent learns by having its "brain" and "hands" work together, improving their coordination through practice. It can also learn new things while it's working, without stopping, by updating its "brain" based on what it learns. This helps it get better over time, even without explicit instructions on what's right or wrong.
Executive Summary
This research investigates the performance disparity between single-agent (SAS) and multi-agent systems (MAS) in multi-hop reasoning tasks, specifically under controlled thinking token budgets. The core argument posits that MAS, by decomposing reasoning into discrete communication steps, inherently introduces information bottlenecks. This is theoretically grounded in the Data Processing Inequality (DPI), which states that sequential processing (like in MAS) can only reduce, not increase, mutual information between the input and the output compared to a single, direct pass (SAS), assuming perfect context utilization. Therefore, under ideal conditions with a fixed budget, SAS should be more information-efficient and thus perform better or equally well. The study empirically validates this by comparing various SAS and MAS architectures across multiple LLM families (Qwen3, DeepSeek, Gemini) on multi-hop reasoning datasets (FRAMES, MuSiQue). The key finding is that SAS consistently match or outperform MAS when the thinking token budget is normalized. This suggests that many reported MAS advantages might be artifacts of unaccounted computation or context handling, rather than inherent architectural benefits. The research also highlights critical issues in evaluation methodologies, such as API-based budget control inaccuracies and benchmark vulnerabilities, which can inflate perceived MAS performance. An important nuance explored is the condition under which MAS *can* become competitive: when the single agent's effective context utilization is degraded. This degradation can occur due to factors like attention dilution, noise, or positional biases within the LLM's processing. In such scenarios, the structured decomposition and explicit communication in MAS might allow them to overcome the single agent's struggle to process a noisy or incomplete context, potentially filtering or reorganizing information more effectively. This provides a specific regime where MAS might offer a tangible advantage, moving beyond simple compute normalization. The work underscores the necessity of precise control over computational budgets and careful consideration of context effects when comparing agentic systems.
Method
Imagine you're trying to solve a complex puzzle. One way is to tackle it all by yourself in one go (Single-Agent System or SAS). Another way is to break it down and have different people work on different parts, passing notes back and forth (Multi-Agent System or MAS). This study tested which approach is better when everyone has the same amount of time (thinking token budget) to work on the puzzle. They found that working alone (SAS) was usually better or just as good. However, if the puzzle pieces themselves were damaged or hard to see (degraded context), the team approach (MAS) could sometimes catch up or even do better. They also looked at how the tools used to measure their work might be misleading.
Week 14 (3 papers)
β–Ό
Executive Summary
This article introduces Matryoshka embedding models, a novel approach to creating text embeddings that can be dynamically truncated to varying dimensions without significant performance degradation. The core problem addressed is the trade-off between embedding dimensionality, which often correlates with performance, and the computational and storage costs associated with high-dimensional embeddings in downstream applications like search and retrieval. The central mechanism of Matryoshka models, inspired by Russian nesting dolls, is their training objective. Unlike standard embedding models that optimize for a single, fixed embedding size, Matryoshka models are trained to produce useful representations across a spectrum of dimensions. This is achieved by applying a loss function not only to the full-dimensional embedding but also to progressively truncated versions of it. The training process encourages the model to prioritize more semantically important information in the earlier dimensions of the embedding vector, while less critical information occupies the later dimensions. This multi-dimensional optimization strategy allows for flexible embedding sizes. For instance, a model trained with dimensions 768, 512, 256, 128, and 64 will produce embeddings that are effective at any of these sizes. The key insight is that by front-loading crucial semantic information, the model can retain a high degree of utility even when truncated to significantly smaller dimensions. This offers a practical solution for optimizing resource usage in large-scale embedding deployments, enabling efficient shortlisting and reranking in retrieval systems, and allowing users to tailor embedding size to specific cost-performance requirements. The article demonstrates this through empirical results, showing that a Matryoshka model trained on the STSBenchmark dataset maintains a higher Spearman similarity than a standard model across various dimensions. Notably, even at 8.3% of its full embedding size, the Matryoshka model preserved 98.37% of its performance, significantly outperforming a standard model at the same reduced dimension. This highlights the practical advantage of Matryoshka models in balancing performance and efficiency.
Method
Imagine you're packing a suitcase. Instead of just throwing everything in, you carefully pack the most important items (like your passport and wallet) right at the top, and less crucial things deeper down. Matryoshka embedding models are trained similarly: they learn to put the most important meaning of a sentence in the first few 'slots' of its numerical representation. This means you can take out only the first few 'slots' (like grabbing only the top items from the suitcase) and still get a good understanding of the sentence, saving space and time.
Executive Summary
The performance of large language models (LLMs) is critically dependent on their 'harness' – the surrounding code that dictates how information is stored, retrieved, and presented to the model. Traditional harness engineering is manual and iterative, relying on practitioner intuition. Existing automated text optimizers, however, are ill-suited for this task due to their reliance on heavily compressed feedback (e.g., scalar scores, short templates), which fails to capture the long-horizon dependencies inherent in harness design. These methods often discard crucial diagnostic information from execution traces, hindering effective debugging and optimization. Meta-Harness addresses this limitation by introducing an outer-loop system that leverages a coding agent to search over harness code. Its core innovation is providing the agent with extensive, selective access to the full history of prior harness candidates via a filesystem. This filesystem stores source code, execution traces, and scores for every evaluated harness. The agent can then query this filesystem using standard tools (like `grep` and `cat`), allowing it to inspect raw data rather than relying on lossy summaries. This rich, granular access to experience enables the agent to form causal hypotheses about harness failures and propose targeted edits, moving beyond simple score-based optimization. The system demonstrates significant improvements across multiple domains, including text classification, retrieval-augmented math reasoning, and agentic coding. For instance, in text classification, Meta-Harness-discovered harnesses outperform state-of-the-art context management systems by a substantial margin while using significantly less context. In math reasoning, a single discovered harness improves accuracy across multiple models. On agentic coding tasks like TerminalBench-2, Meta-Harness surpasses hand-engineered baselines, achieving top rankings. The key insight is that providing an agent with direct, selective access to detailed historical experience, rather than compressed feedback, unlocks more sophisticated and effective optimization strategies for complex, stateful systems like LLM harnesses.
Method
Imagine you're trying to build the best possible instruction manual for a robot. This manual (the 'harness') tells the robot exactly how to use its tools and what information to pay attention to. Instead of writing the manual yourself, you have a smart assistant (the 'coding agent') that can try to improve it. This assistant doesn't just get a grade on the manual; it gets to see all the robot's past attempts, including exactly what it did, what happened, and how well it performed. The assistant uses this detailed history to figure out why the robot failed and then writes a better version of the manual. It keeps doing this, learning from every mistake and success, until it creates a highly effective manual for the robot.
Executive Summary
The MiniMax M2.7 model introduces a novel approach to AI model development by integrating a recursive self-optimization (RSO) pipeline directly into its training regimen. Unlike traditional LLMs that are static post-training, M2.7 is designed to iteratively improve its performance by identifying and rectifying its own deficiencies. This process begins with the model generating responses to a curated set of evaluation prompts. These outputs are then assessed by a reward model, which could be a separate AI or a self-evaluation mechanism within M2.7 itself. The core of the RSO loop lies in the identification of performance gaps where the model's outputs fall below a predefined quality threshold. Subsequently, the model generates synthetic data specifically tailored to address these identified weaknesses. This synthetic data then feeds into a targeted fine-tuning phase, updating the model's parameters. The cycle repeats, with each iteration producing a refined model that, in turn, generates higher-quality training signals for the subsequent round. This recursive nature allows for continuous improvement without the need for constant human annotation or large-scale, general retraining. MiniMax reports a significant 30% improvement on internal benchmarks due to this RSO process, a figure that underscores the efficacy of the mechanism in enhancing specific capabilities. The architectural innovation is not in real-time self-modification during inference, but in the recursive nature of the training process itself, where the model actively participates in its own development lifecycle. This paradigm shift has profound implications for the maintenance and evolution of AI agents, promising reduced operational overhead and compounding performance gains in multi-agent systems.
Method
Imagine an AI model that's like a student who can also grade their own homework. First, the AI takes a test. Then, it uses a special grading tool to see how well it did on each question. If it finds a question it got wrong or answered poorly, it doesn't just stop. Instead, it creates new practice questions specifically for the topics it struggled with. It then studies these new practice questions to get better. This whole process repeats, making the AI smarter over time without needing a human teacher to create every single new lesson.
Week 13 (4 papers)
β–Ό
This week Google introduces TurboQuant that allows to reduce the KV-cache footprint by 5x and highly improves vector search. The paper has a tremendous impact on open source model RAM consumption for high context applications and for vector search in general. "Thinkingβ€”Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender" argues, that your brain surrenders at a certain point when you review AI agent tasks. You take on the result as your thought and forget to be critical. "Hyperagents: Self-Referential Agents for Open-Ended Self-Improvemen" is a smart method that let a meta agent not only improve the agent but also itself. This allows exceptional transfer learning. Agentic AI and the Next Intelligence Explosion brings another perspective to AGI. It argues that models could be social and discuss problems as we do.
Executive Summary
The prevailing paradigm of an intelligence explosion as a singular, monolithic AI entity achieving godlike intelligence is fundamentally flawed. Instead, the authors propose that emergent intelligence, analogous to historical evolutionary transitions, will be plural, social, and deeply intertwined with existing human and AI systems. This perspective shifts the focus from individual AI capabilities to the complex interactions and emergent properties of multi-agent systems. The core mechanism driving this phenomenon is the observation that even ostensibly singular reasoning models, when optimized for accuracy, spontaneously develop internal 'societies of thought.' These are simulated multi-agent interactions, akin to internal debates among distinct cognitive perspectives, which demonstrably improve reasoning performance on complex tasks. This emergent behavior is not explicitly trained but arises from reinforcement learning that rewards task completion, suggesting that robust reasoning is inherently a social process, even when occurring within a single computational entity. The implications of this 'plurality model' are profound. It suggests that the future of AI development should not solely focus on scaling individual agent capabilities but on designing and orchestrating complex, hybrid human-AI social systems. Drawing parallels from social and organizational sciences, the authors advocate for building 'agent institutions'β€”structured frameworks of roles, norms, and protocolsβ€”that govern interactions within these emergent AI ecologies. This contrasts with current alignment paradigms like Reinforcement Learning from Human Feedback (RLHF), which are described as dyadic and insufficient for scaling to billions of interacting agents. The proposed 'institutional alignment' emphasizes the importance of well-defined roles and protocols over individual agent virtue, mirroring how human societies function. This shift necessitates a re-evaluation of AI governance, moving towards constitutional frameworks that ensure checks and balances between diverse AI and human actors. The intelligence explosion is thus not a future event but an ongoing process, manifesting in the complex interactions within AI models and the evolving human-AI collaborations that are reshaping knowledge work and societal structures.
Method
Imagine how humans get smarter not just by studying harder, but by discussing ideas with others, debating, and working in teams. This paper suggests that advanced AI models are doing something similar internally. When these AIs are trained to be good at solving problems, they start to act like a group of different AI 'personalities' having a conversation with each other inside their own 'minds.' This internal teamwork helps them reason better. The paper argues that instead of building one super-smart AI, we should focus on creating smart teams of AIs and humans, much like how societies organize themselves with rules and roles to function effectively.
Executive Summary
This paper introduces TurboQuant, a novel online vector quantization (VQ) framework designed to minimize both Mean Squared Error (MSE) and inner product distortion, addressing limitations of existing methods that often compromise one for the other or require data-dependent training. The core innovation lies in a two-stage approach that leverages random rotations and coordinate-wise quantization. Initially, input vectors are randomly rotated. This rotation induces a Beta distribution on each coordinate, which, in high dimensions, approximates a Gaussian distribution and leads to near-independence between coordinates. This near-independence is critical, as it allows for the application of optimal scalar quantizers to each coordinate independently, achieving near-optimal MSE distortion rates. The method is data-oblivious, meaning it does not require prior knowledge of the data distribution, making it suitable for online applications. To address the bias introduced by MSE-optimal quantizers in inner product estimation, TurboQuant employs a second stage. This stage applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to the residual vector (the difference between the original and MSE-quantized vector). This two-stage process results in an unbiased and low-distortion inner product quantizer. The authors provide theoretical guarantees, demonstrating that TurboQuant achieves distortion rates within a small constant factor (approximately 2.7) of the information-theoretic lower bounds for both MSE and inner product distortion, with significant improvements at lower bit-widths. Experimental results validate these claims, showing competitive performance in KV cache quantization and nearest neighbor search tasks, outperforming existing methods in recall while drastically reducing indexing time. The key insight is that by decoupling the problem into coordinate-wise MSE optimization and then correcting for inner product bias using a specialized residual quantizer (QJL), TurboQuant achieves a superior trade-off between compression and accuracy. The random rotation acts as a data-agnostic preprocessing step that transforms the data distribution into a more amenable form for independent scalar quantization, a significant departure from methods that rely on complex data-dependent codebook learning. This approach is particularly valuable for real-time applications where online adaptation is crucial, such as in large language model inference.
Method
Imagine you have a very long list of numbers (a vector) that you want to compress. TurboQuant does this in two main steps. First, it slightly jiggles or rotates the entire list of numbers. This jiggling makes it so that each number in the list can be compressed independently, like compressing individual words in a sentence. It uses a smart way to compress each number to save space, focusing on minimizing the overall error (like making sure the compressed list is as close as possible to the original list in terms of overall 'size'). However, just compressing each number might mess up how well you can calculate the 'similarity' (inner product) between two lists later. So, the second step takes the small errors left over from the first compression and compresses them again, but this time specifically to make sure the similarity calculations are accurate and unbiased. This two-step process allows for very high compression while keeping both the overall accuracy and the accuracy of similarity calculations very good, even with very few bits per number.
Executive Summary
This paper introduces 'hyperagents,' a novel framework for recursive self-improvement in AI systems that aims to overcome the limitations of fixed meta-level mechanisms. The core innovation lies in making the self-improvement process itself editable and subject to improvement, a concept termed 'metacognitive self-modification.' Unlike prior systems like the Darwin GΓΆdel Machine (DGM), which rely on handcrafted instruction generation for self-modification, hyperagents integrate task and meta-agents into a single, self-referential program. This allows the meta-agent to evolve, not just in its ability to solve tasks, but crucially, in its ability to generate future improvements. This is achieved by extending the DGM's open-ended exploration structure to a hyperagent architecture, creating DGM-Hyperagents (DGM-H). The DGM-H maintains an archive of evolving agents, where successful variants serve as stepping stones. The key mechanism is that a hyperagent can modify not only its task-solving code but also the code that governs its self-modification process. This enables a departure from domain-specific alignment assumptions, where improvements in task performance directly translate to improvements in self-improvement capability. The framework is demonstrated across diverse domains, including coding, paper review, robotics reward design, and math grading, showing substantial and generalizable gains. The meta-level improvements learned by DGM-H are shown to transfer across domains and compound over multiple runs, suggesting a path towards self-accelerating progress on any computable task. The work highlights the potential for AI systems to not only find better solutions but to continually improve their search for how to improve, with significant implications for AI capabilities and safety.
Method
Imagine an AI that can not only learn to do a task better, but can also learn how to learn better. This is what 'hyperagents' do. They are like a team where one part focuses on doing the job, and the other part focuses on improving how the whole team works, including how they learn to improve. By making the 'improver' part of the team also able to improve itself, the AI can get better and better at getting better, not just at the task itself. This allows it to learn and improve across many different kinds of jobs, not just one.
Executive Summary
This paper introduces Tri-System Theory, a novel cognitive framework that extends traditional dual-process models of human reasoning by incorporating artificial intelligence (AI) as a third cognitive system (System 3). The theory posits that System 3, which is external, automated, and data-driven, operates alongside internal intuitive (System 1) and deliberative (System 2) processes. This framework addresses the increasing reliance on AI in decision-making, proposing that System 3 can supplement, supplant, or reconfigure internal cognitive functions. A key prediction is 'cognitive surrender,' a phenomenon where individuals uncritically adopt AI outputs, bypassing their own reasoning processes. This surrender is characterized by a shift in the locus of cognitive control to the AI, driven by factors such as the AI's fluency, confidence, and the user's desire for cognitive ease, time pressure, or reduced cognitive load. The research empirically demonstrates cognitive surrender through three experiments using a modified Cognitive Reflection Test. Participants with access to an AI assistant frequently adopted its outputs, even when incorrect, leading to reduced accuracy compared to a baseline condition. This effect persisted across various conditions, including time pressure and incentive structures, although the latter showed some mitigation. Individual differences, such as trust in AI, need for cognition, and fluid intelligence, were found to moderate susceptibility to cognitive surrender, with higher trust in AI and lower cognitive capacity increasing vulnerability. The findings suggest that System 3 is not merely a tool but an active cognitive agent that fundamentally reshapes human judgment, agency, and accountability in the age of AI, necessitating a re-evaluation of cognitive architecture and the development of strategies to promote calibrated AI engagement.
Method
Researchers tested how people use AI when solving problems. They had participants solve puzzles, sometimes with an AI assistant available and sometimes without. They also changed how accurate the AI was, sometimes making it right and sometimes wrong. In some tests, they added time pressure, and in others, they offered rewards for correct answers. They measured how often people used the AI, whether they followed its advice, and how accurate their final answers were. They also looked at how individual traits, like how much people trust AI, affected their choices. This helped them understand when people rely too much on AI, a phenomenon called 'cognitive surrender.'
March β–Ό
Week 12 (8 papers)
β–Ό
This week’s paper β€œHow LLMs Distort Our Written Language” shows that LLMs can significantly alter the meaning of expert feedback on texts compared to human revision. In particular, they tend to shift strongly opinionated feedback toward more neutral positions. β€œExpert Personas Improve LLM Alignment but Damage Accuracy” highlights an interesting trade-off: prompting models with expert roles (e.g., β€œYou are a statistics professor”) increases perceived reliability and user trust, but actually reduces factual accuracy. β€œOpenClaw-RL: Train Any Agent Simply by Talking” proposes a different reinforcement learning paradigm. Instead of rewarding only final outcomes, it assigns feedback to intermediate stepsβ€”such as thoughts, tool usage, and UI actionsβ€”enabling more fine-grained learning. β€œMeasuring Progress Toward AGI: A Cognitive Framework” introduces ten key cognitive capabilities inspired by human intelligence that can serve as targets for evaluating progress toward AGI. β€œWhy AI Systems Don’t Learnβ€”and What to Do About It” provides a critical perspective on current AI systems, arguing that essential components for autonomous learningβ€”drawn from cognitive scienceβ€”are still missing. Finally, SkillNet presents an interesting approach to structuring and organizing skills within agent systems.
Executive Summary
This research addresses the nuanced impact of expert persona prompts on Large Language Models (LLMs), revealing a critical dichotomy: while personas enhance alignment-dependent tasks like safety and preference following, they detrimentally affect pretraining-dependent knowledge retrieval and discriminative accuracy. The core problem lies in the conflicting objectives these tasks impose on LLMs. Alignment tasks, often reinforced during instruction tuning, benefit from the explicit guidance and stylistic adaptation that persona prompts provide. Conversely, knowledge retrieval tasks rely on the LLM's foundational, pre-trained knowledge, which can be disrupted by the instruction-following mode activated by persona prompts. This explains the historically mixed results in the literature, where some studies report gains while others observe performance degradation. The proposed solution, PRISM (Persona Routing via Intent-based Self-Modeling), tackles this by developing a self-contained, bootstrapped pipeline that internalizes intent-conditioned expert persona routing. PRISM avoids external data or models by using the LLM itself to generate synthetic queries and answers, then employing a self-verification mechanism to identify instances where a persona genuinely improves output quality. This curated data is then used to train a lightweight, gated LoRA adapter. The gate acts as an intelligent router, activating the persona-specific adapter only when beneficial, thereby preserving the base model's performance on tasks where personas are detrimental. This approach effectively disentangles the beneficial alignment signals from the harmful accuracy degradation, achieving a synergistic effect where alignment is boosted without compromising core knowledge capabilities. The key insight is that persona effectiveness is not a universal property but is fundamentally tied to the LLM's training history and the specific task's reliance on pre-trained knowledge versus instruction-following capabilities. PRISM operationalizes this insight by learning to dynamically route to persona-enhanced behaviors only when they align with the task's requirements, effectively creating a conditional expert. This mechanism allows LLMs to leverage the strengths of persona prompting for alignment while mitigating its weaknesses for knowledge-intensive tasks, offering a more robust and reliable method for persona integration in production systems.
Method
Imagine you have a helpful assistant who is great at creative writing but sometimes gets facts wrong. This research found that giving the assistant a specific 'expert' role (like 'writer' or 'scientist') makes them better at creative tasks but worse at factual ones. To fix this, they created a smart 'gatekeeper' system called PRISM. PRISM first has the assistant generate questions and answers for different expert roles. Then, the assistant itself checks which of these expert answers are actually better than its normal answers. PRISM uses this information to train a small add-on that the gatekeeper can turn on only when the expert role is truly helpful, ensuring the assistant stays accurate for factual questions while still being great at creative ones.
Executive Summary
This research investigates the subtle yet pervasive influence of Large Language Models (LLMs) on human writing, demonstrating that LLMs not only alter stylistic elements like voice and tone but also consistently shift the intended meaning and argumentative stance of text. The study employs a multi-faceted approach, combining a human user study with quantitative analysis of LLM-edited essays and real-world data from scientific peer reviews. The core mechanism of distortion appears to stem from LLMs' tendency to homogenize text towards a statistically probable, neutral, and often more formal style, diverging significantly from human-generated content. This homogenization is observed even when LLMs are prompted with specific feedback or instructed to make minimal edits, indicating a fundamental difference in how LLMs process and revise text compared to human editors. The findings highlight a critical gap between the perceived utility of LLMs as writing assistants and their actual impact on the semantic and stylistic integrity of human expression, raising concerns about the long-term effects on cultural and scientific discourse.
Method
Researchers studied how people write with and without AI writing tools. They asked people to write essays and also had AI rewrite existing essays. They then compared the AI-written or AI-edited essays to the original human writing. They looked at how the meaning, word choices, emotions, and sentence structures changed. They also analyzed real AI-written reviews from a scientific conference to see if AI changed how scientists evaluate research. This helps them understand if AI changes what people mean when they write.
Executive Summary
The paper introduces OpenClaw-RL, a novel reinforcement learning (RL) framework designed to enable continuous, online learning for AI agents by leveraging "next-state signals." The core insight is that signals generated after an agent's actionβ€”such as user replies, tool execution results, or GUI state changesβ€”are not merely contextual but contain rich evaluative and directive information about the preceding action. Existing agentic RL systems largely discard this information or process it offline, missing a crucial opportunity for live improvement. OpenClaw-RL unifies these diverse interaction types into a single training loop, treating personal conversations, terminal interactions, GUI operations, software engineering tasks, and tool-call traces as homogeneous data streams for policy optimization. The framework's technical innovation lies in its fully decoupled, asynchronous architecture. This design comprises four independent loops: policy serving, environment interaction, reward judging (PRM), and policy training. This decoupling ensures zero interruption to serving, allowing for continuous training from live, heterogeneous streams without batching or pausing. Two complementary methods are employed for signal recovery: Binary RL uses a Process Reward Model (PRM) to convert evaluative signals into dense, scalar rewards for each turn, providing broad coverage. Hindsight-Guided On-Policy Distillation (OPD) extracts textual hints from next-state signals to generate token-level directional supervision, offering richer, more specific feedback for improving actions. OPD specifically distills directive information, which scalar rewards cannot capture, by constructing an enhanced teacher context and providing token-level advantage signals. The combination of these methods, weighted appropriately, yields significant performance gains. OpenClaw-RL demonstrates scalability across personal agents (e.g., conversational assistants) and general-purpose agents in various environments (terminal, GUI, SWE, tool-call). For personal agents, it enables continuous personalization based on user interactions. For general agents, it supports large-scale RL training by leveraging cloud-hosted environments. The framework's ability to integrate process rewards (from PRMs) with outcome rewards is shown to be vital for long-horizon tasks, addressing the sparse reward problem inherent in traditional RL. The key takeaway is that by treating all interaction signals as a unified, live learning source, OpenClaw-RL allows agents to improve simply by being used, moving beyond static datasets and offline training paradigms.
Method
Imagine an AI agent that's learning to be better by paying attention to what happens *after* it does something. Instead of just moving on, it looks at the user's reaction, the result of a tool it used, or the state of a program. OpenClaw-RL treats all these reactions as valuable feedback. It has two main ways of learning from this feedback: First, it gets a simple 'good' or 'bad' score for its actions, like a teacher grading homework. Second, if the feedback is a specific suggestion, like 'you should have done X first,' it uses that detailed advice to learn exactly how to change its actions, like a student getting precise instructions. By combining these two learning methods and applying them to all sorts of tasksβ€”from chatting to codingβ€”the agent gets better simply by being used.
Executive Summary
This paper addresses the underexplored aspect of 'scientific taste' in AI scientists, defining it as the capacity to judge and propose research ideas with high potential impact. The core problem is that while AI has advanced in executing research tasks (literature search, experimentation), its ability to discern and generate impactful ideas remains nascent. The authors propose Reinforcement Learning from Community Feedback (RLCF) as a novel training paradigm to imbue AI with this 'taste'. RLCF leverages large-scale, naturally occurring community signals, primarily citations, as a proxy for scientific impact. This approach circumvents the need for expensive human annotation (as in RLHF) and the limitations of verifiable rewards (as in RLVR) for open-ended tasks like scientific judgment and ideation. The methodology involves two main components: SCIENTIFIC JUDGE and SCIENTIFIC THINKER. SCIENTIFIC JUDGE is a generative reward model trained to predict the relative impact of research ideas using a dataset (SCIJUDGEBENCH) of 700K field- and time-matched paper abstract pairs, where higher-cited papers are considered preferred. This preference modeling is achieved through Group Relative Policy Optimization (GRPO), where the model learns to predict the correct preference label for a given pair. SCIENTIFIC THINKER, a policy model, is then trained using SCIENTIFIC JUDGE as a reward function. It learns to propose novel research ideas by engaging in a comparison-based GRPO process, where its generated ideas are evaluated by SCIENTIFIC JUDGE in a round-robin tournament. This alignment phase allows SCIENTIFIC THINKER to generate ideas that are not only novel but also likely to be impactful according to community signals. The key insight is that 'scientific taste' is not an innate human quality but a learnable objective that can be distilled from collective human judgment as reflected in citation patterns. The RLCF framework, by using community feedback, provides a scalable and effective mechanism for learning this objective. The authors demonstrate that SCIENTIFIC JUDGE significantly outperforms state-of-the-art LLMs in predicting paper impact and generalizes across time, fields, and evaluation metrics (peer review scores). Furthermore, SCIENTIFIC THINKER, guided by SCIENTIFIC JUDGE, generates ideas with demonstrably higher potential impact than baseline models. This work represents a significant step towards developing AI systems capable of genuine scientific creativity and judgment.
Method
Imagine you want to teach a computer to have good 'taste' in science, meaning it can tell which research ideas are likely to be important. First, you gather lots of research papers and look at which ones get cited a lot by other scientists – that's like community approval. You create pairs of papers, one that got many citations and one that got fewer, and teach a 'judge' AI to tell them apart. Then, you use this 'judge' AI to help another AI, a 'thinker,' come up with its own new research ideas. The thinker proposes ideas, and the judge AI tells it which ones are better, guiding the thinker to generate more impactful ideas over time. This way, the AI learns to 'taste' good science from what the scientific community values.
Executive Summary
Current AI systems, despite their impressive capabilities, fundamentally lack autonomous learning. They are trained offline by human experts through rigid pipelines (MLOps) that involve extensive data curation, model building, and fine-tuning. Once deployed, these models are static and cannot adapt to new environments or unforeseen data distributions, a phenomenon known as domain mismatch. This limitation stems from three core deficiencies: the inability to actively select their own training data, the lack of flexible switching between different learning modes (observation vs. action), and the absence of meta-cognitive abilities to monitor their own performance and learning progress. The paper proposes an integrated cognitive architecture, System A-B-M, inspired by human and animal cognition, to address these limitations. System A handles learning from observation (e.g., self-supervised learning), System B handles learning from action (e.g., reinforcement learning), and System M acts as a meta-controller, orchestrating the interaction between A and B, managing data flow, and dynamically adjusting learning strategies based on internal meta-states (e.g., prediction errors, uncertainty). This meta-control mechanism, drawing parallels to biological meta-cognitive functions, enables autonomous agents to learn and adapt in real-world, dynamic environments. The proposed framework aims to bridge the gap between current AI paradigms and the more flexible, adaptive learning observed in biological organisms, paving the way for more robust and generalizable AI systems. The core insight is that true autonomous learning requires not just learning algorithms, but an orchestrating meta-control system that dynamically manages the learning process itself, mirroring biological intelligence's adaptive nature.
Method
Imagine an AI that learns like a child. It has two main ways of learning: one is by watching and listening (System A), like learning from books or videos. The other is by doing and trying things out (System B), like playing with a toy and seeing what happens. A third part, the 'manager' (System M), decides when to watch, when to do, and what to focus on, based on how well it thinks it's learning. This manager is like the brain's control center that helps the AI learn more efficiently and adapt to new situations, rather than needing humans to constantly retrain it.
Executive Summary
The pursuit of Artificial General Intelligence (AGI) is currently hampered by a lack of a standardized, empirical framework for measuring progress. This ambiguity leads to subjective claims, hinders effective governance, and impedes clear communication within the research community. To address this, the authors propose a cognitive framework grounded in decades of research from psychology, neuroscience, and cognitive science. The core of this framework is a Cognitive Taxonomy, which deconstructs general intelligence into ten key cognitive faculties. These faculties are not defined by specific computational mechanisms but by observable capabilities, drawing inspiration from human cognition. The taxonomy includes foundational faculties like Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, and Executive Functions, alongside composite faculties such as Problem Solving and Social Cognition. This approach is deliberately agnostic to the underlying implementation, focusing instead on *what* a system can do, aligning with Marr's levels of analysis. The framework's second component is a rigorous, three-stage evaluation protocol. This protocol involves conducting a comprehensive cognitive assessment of AI systems across a broad suite of targeted, held-out, and independently verified cognitive tasks. Crucially, it mandates collecting human baselines on the same tasks to establish a meaningful comparison point. The final step is to construct 'cognitive profiles' that visually map a system's strengths and weaknesses relative to human performance distributions. This allows for a nuanced understanding of a system's generality and capability, moving beyond single-score benchmarks. The authors emphasize that this framework is a starting point, intended to foster a more empirical and grounded science of AGI, enabling better tracking of progress and informed discussions about its development and deployment.
Method
Imagine we want to know how smart a computer is becoming, like how close it is to human-level intelligence. First, we break down 'smartness' into ten key skills that humans use, like seeing, remembering, thinking, and solving problems. Then, we create a set of challenging puzzles and tests for each of these skills. We give these tests to the computer and also to many different people. Finally, we compare how the computer did on each test compared to the people. This helps us see where the computer is strong, where it's weak, and how far it has come.
Executive Summary
Modern large language models (LLMs) predominantly utilize residual connections, typically combined with Pre-Normalization, to facilitate gradient flow and enable training of deep networks. While effective for gradient propagation, these standard residual connections aggregate all preceding layer outputs with fixed, uniform weights. This uniform aggregation leads to a phenomenon termed 'PreNorm dilution,' where hidden-state magnitudes grow linearly with depth (O(L)). This growth progressively diminishes the relative contribution of earlier layers, effectively burying information and making a significant fraction of layers prunable with minimal performance loss. The core insight of this work is that the depth-wise aggregation in standard residuals, like sequence modeling before attention, suffers from a lack of selective information retrieval. To address this, the paper proposes 'Attention Residuals' (AttnRes), a mechanism that replaces the fixed additive aggregation with a learned, input-dependent softmax attention mechanism over preceding layer outputs. Specifically, each layer 'attends' to all previous layer outputs, using a single learned pseudo-query vector to compute attention weights. This allows each layer to selectively aggregate relevant information from its history, mitigating the dilution problem and enabling more uniform output magnitudes and gradient distributions across depth. To make this mechanism scalable for large models, the authors introduce 'Block Attention Residuals' (Block AttnRes). This variant partitions the layers into blocks and performs attention over block-level representations rather than individual layer outputs. This significantly reduces the memory footprint from O(Ld) to O(Nd), where N is the number of blocks, making it practical for large-scale training. The paper further details infrastructure optimizations, including cross-stage caching and a two-phase computation strategy, to minimize the overhead associated with Block AttnRes during distributed training and inference. Empirical results demonstrate that AttnRes consistently outperforms standard residual connections across various model sizes and tasks, achieving comparable or better performance with significantly less compute. The analysis of training dynamics reveals that AttnRes effectively mitigates PreNorm dilution, leading to more stable training and improved downstream performance, particularly on compositional reasoning tasks.
Method
Imagine a student trying to learn a complex subject by reading many textbooks. Standard learning methods make the student add up everything they read from all books, giving equal importance to every sentence. This can lead to confusion and the student forgetting what they learned early on. Attention Residuals (AttnRes) is like giving the student a special highlighter. Instead of just adding everything, the student can now selectively highlight and focus on the most important parts from previous books based on what they are currently reading. This helps them build knowledge more effectively. For very large subjects, a 'Block' version groups books into chapters, so the student only needs to remember the main ideas from each chapter, making it much easier to manage.
Executive Summary
The advancement of AI agents is currently hampered by a fundamental lack of systematic skill consolidation and transfer. Existing agents often "reinvent the wheel," failing to leverage prior experiences or established strategies, leading to inefficient and repetitive problem-solving. This paper introduces SkillNet, an open infrastructure designed to address this limitation by creating, evaluating, and organizing AI skills at scale. SkillNet conceptualizes skills as unified knowledge representations that bridge natural language understanding with machine-executable logic. It employs a comprehensive Skill Ontology comprising a taxonomic layer for functional categorization, a relational layer for inter-skill dependencies and composition, and a skill-package layer for modular deployment. A key innovation is the multi-dimensional evaluation framework that assesses skills across Safety, Completeness, Executability, Maintainability, and Cost-awareness. This framework, largely automated via an LLM-based evaluator, ensures the quality and reliability of the skills within the repository. The infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that agents augmented with SkillNet achieve significant performance improvements, with average rewards increasing by 40% and execution steps decreasing by 30% across various backbone models. This indicates that SkillNet effectively transforms fragmented experience into durable, composable assets, enabling agents to progress from transient experience to sustained mastery. The system's ability to formalize skills as evolving, interconnected entities is crucial for building more robust, generalizable, and continuously improving AI agents, moving beyond episodic learning towards cumulative intelligence.
Method
Imagine building with LEGOs. SkillNet is like a giant LEGO store for AI agents. It automatically creates new LEGO bricks (skills) from various sources like instructions or past projects. Then, it carefully checks each brick to make sure it's safe, works correctly, and is easy to use. Finally, it organizes these good bricks into a smart system that understands how they fit together, allowing AI agents to easily find and use the right bricks to build complex things, rather than having to make every brick from scratch.
Week 11 (5 papers)
β–Ό
This week the paper "How much Do LLMs Hallucinate in Document Q&A Scenarios? ..." is about that even good models hallucinate in RAG systems and context length worsen the problem to a high extend. "Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought" argue, that a lot of CoT steps are decorative and do not help generating the correct answer. Their created True Thinking Score (0-1) only reaches β‰₯ 0.7 in 2% of the thinking cases.
Executive Summary
This editorial addresses the critical need for rigorous methodological and statistical appraisal of observational studies, particularly in light of their increasing prominence in generating real-world evidence. While randomized controlled trials (RCTs) are the gold standard for establishing causality due to their inherent control over confounding, they are often impractical, costly, and limited in external validity due to highly selected populations. Observational studies, conversely, offer a cost-effective and timely means to gather insights from large datasets, but they are inherently susceptible to various biases. The authors delineate key considerations for readers to critically evaluate these studies, focusing on the accurate identification and handling of confounding, mediation, and collider variables. Confounders, which are associated with both exposure and outcome, must be adjusted for to prevent biased estimates. Mediators, which lie on the causal pathway, should be analyzed separately and not simply included as covariates. Collider variables, influenced by both exposure and outcome, can induce spurious associations if conditioned upon, necessitating the use of directed acyclic graphs (DAGs) to identify and avoid them. The editorial also emphasizes the importance of transparent reporting and careful assessment of how missing data are handled, advocating for methods like multiple imputation when appropriate, but with caution regarding the proportion of missingness. Overfitting in statistical models is another concern, where models become too tailored to the training data, compromising generalizability. Propensity scores (PS) are presented as a powerful tool for reducing confounding in observational studies by balancing baseline covariates between groups, with methods like Inverse Probability of Treatment Weighting (IPTW) aiming to estimate the Average Treatment Effect (ATE) and matching aiming for the Average Treatment Effect in the Treated (ATT). The authors stress that the choice of estimand should be driven by clinical relevance. Finally, the importance of robustness and sensitivity analyses to assess the impact of unmeasured confounders and model assumptions is highlighted, alongside a nuanced discussion on adjusting for multiple comparisons. The overarching message is that while observational studies offer invaluable insights, their interpretation demands a sophisticated understanding of potential biases and the statistical techniques employed to mitigate them.
Method
Imagine you're reading a detective story. This article is like a guide for spotting clues and understanding how the detective solved the case. It explains that observational studies are like gathering clues from real life, which is useful but can be tricky because some clues might be misleading (biases). The guide teaches you to look for 'confounders' (things that might make a suspect look guilty or innocent unfairly), 'mediators' (steps in a sequence of events), and 'colliders' (things that are influenced by both the suspect and the crime). It also tells you to check how the detective handled missing pieces of evidence and if their conclusions are too specific to just one scene (overfitting). Finally, it suggests looking at how the detective tested their theories with different scenarios to be sure of their conclusion.
Executive Summary
The paper introduces a novel framework for intelligent AI delegation, addressing the limitations of current heuristic-based and brittle approaches. The core problem lies in enabling AI agents to safely and effectively decompose complex tasks and delegate them to other agents or humans within dynamic, potentially adversarial environments. Existing methods often fail to account for crucial factors like dynamic adaptation to environmental changes, robust failure handling, and the nuanced aspects of delegation beyond simple task allocation. The proposed framework is built upon five key requirements: dynamic assessment, adaptive execution, structural transparency, scalable market coordination, and systemic resilience. Dynamic assessment involves continuously inferring a delegatee's state, including competence, reliability, and intent, by analyzing real-time data on resource availability, load, and projected task duration. Adaptive execution ensures that delegation decisions are not static but can dynamically adjust to environmental shifts, resource constraints, or detected failures, allowing for mid-execution switching of delegatees. Structural transparency aims to overcome the opacity of current AI-AI delegation by making the process auditable, ensuring accountability and distinguishing between incompetence and malice. Scalable market coordination leverages market mechanisms, supported by trust and reputation systems, to facilitate efficient and web-scale task delegation. Finally, systemic resilience focuses on preventing cascading failures and ensuring that the delegation ecosystem is robust against adversarial attacks and unforeseen events. The framework integrates these requirements into a cohesive system that aims to inform the development of protocols for the emerging agentic web, moving beyond ad-hoc, brittle, and untrustworthy delegation towards a more robust and intelligent paradigm.
Method
Imagine AI agents needing to work together on big projects. This paper proposes a smarter way for them to do this. Instead of just following simple rules, the system constantly checks how well each AI is doing, adapts if things change, and makes sure the whole process is clear and understandable. It uses market-like systems where agents can bid on tasks, and includes built-in safety nets to prevent failures and attacks. This makes delegation more reliable and trustworthy, like a well-managed team where everyone knows their role and can adjust to challenges.
Executive Summary
This study rigorously quantifies hallucination rates in large language models (LLMs) performing document question-answering (Q&A) tasks, a critical capability for enterprise AI. The core challenge addressed is the unreliability of existing benchmarks, which suffer from data contamination, biased LLM judges, and insufficient scale. To overcome these limitations, the researchers leverage RIKER (Retrieval Intelligence and Knowledge Extraction Rating), a ground-truth-first methodology that generates synthetic documents and questions from a known relational database. This paradigm inversion enables deterministic scoring, contamination resistance, and arbitrary scalability without human annotation or LLM judges. The evaluation was conducted at an unprecedented scale, processing over 172 billion tokens across 35 open-weight models, three context lengths (32K, 128K, 200K tokens), four temperature settings, and three hardware platforms. The findings reveal that hallucination is pervasive and significantly influenced by context length, with fabrication rates tripling from 32K to 128K and exceeding 10% for all models at 200K. Model selection is the most dominant factor, with model families exhibiting distinct fabrication resistance capabilities, often independent of model size. Temperature effects are nuanced: while T=0.0 often yields the highest overall accuracy, higher temperatures can reduce fabrication and significantly decrease coherence loss (infinite generation loops). Crucially, grounding ability (extracting facts) and fabrication resistance (refusing to invent facts) are shown to be distinct capabilities, meaning models adept at retrieval may still be prone to hallucination. Hardware platforms have a negligible impact on performance. The study's key insight is that current LLMs, even under optimal conditions, exhibit non-trivial hallucination rates, and this rate escalates dramatically with context length, underscoring the need for robust safeguards and careful evaluation at deployment-specific context lengths.
Method
Imagine you want to test how well a student can answer questions based on a textbook. Instead of giving them a real textbook and questions, this study creates a "perfect" digital textbook and a set of "perfect" questions from scratch. This way, they know the exact right answer to every question. They then give these "perfect" books and questions to many different AI models. They test the AIs with different amounts of text to read, different settings (like how creative or strict they should be), and even on different types of computers. By comparing the AI's answers to the "perfect" answers, they can precisely measure how often the AI makes things up (hallucinates) or fails to answer correctly.
Executive Summary
This paper addresses a critical gap in understanding Large Language Model (LLM) reasoning: the faithfulness of Chain-of-Thought (CoT) verbalizations to internal computation. The authors propose a novel metric, the True Thinking Score (TTS), to quantify the causal contribution of each step in a CoT to the model's final prediction. This metric is derived from a causal framework extending Average Treatment Effect (ATE) to account for both necessity (AND logic) and sufficiency (OR logic) of a step's contribution, thereby overcoming limitations of prior perturbation-based methods that primarily focused on necessity. Experiments reveal a stark dichotomy: LLMs' CoTs are often a mix of 'true-thinking' steps that genuinely influence the output and 'decorative-thinking' steps that appear to reason but have minimal causal impact. This is quantified by the TTS, which is found to be long-tailed, with a vast majority of steps exhibiting low scores. For instance, on the AIME dataset, only a mere 2.3% of reasoning steps for Qwen-2.5 have a TTS β‰₯ 0.7. This suggests that LLMs frequently verbalize reasoning without internally performing it, challenging the efficiency and trustworthiness of CoT. Furthermore, the study demonstrates that 'aha moments' or self-verification steps can be decorative, meaning the model might verbalize a self-correction but not actually use it to alter its internal state or final decision. The core mechanistic insight is the identification of a 'True Thinking direction' in the LLM's latent space. By steering the model's hidden states along this direction, researchers can causally influence whether a specific CoT step is internally followed or disregarded. This directionality is shown to be generalizable across models and datasets, indicating a fundamental mechanism for controlling internal reasoning engagement. The implications are significant: LLM reasoning might be less efficient than it appears, and CoT cannot be solely relied upon for interpretability or safety monitoring. The work shifts focus from what LLMs say to what they actually compute internally, opening avenues for more robust interpretability and training objectives that promote genuine reasoning.
Method
Imagine a student writing out their steps to solve a math problem. This paper asks if every step the student writes down is actually used to get the final answer, or if some steps are just for show. They created a way to score each step based on how much it *really* affects the final answer, like a "causal score." They did this by slightly changing a step and seeing if the final answer changed. If changing a step *always* changes the answer, it's a "true thinking" step. If changing it doesn't matter, it's a "decorative" step. They found that many steps are decorative. They also found a "thinking direction" in the AI's brain that can be used to make it either use or ignore a specific step.
Executive Summary
This collection of blog posts by Suvash Sedhain delves into the intricate mechanisms of modern machine learning, with a particular focus on Large Language Models (LLMs) and recommender systems. The posts aim to demystify complex architectures and algorithms, moving beyond high-level descriptions to provide an intuition-first, technically grounded understanding. For LLMs, the analysis dissects the internal workings of Transformers, explaining how components like Query (Q), Key (K), and Value (V) matrices, along with attention mechanisms, function as data-dependent mixing operations. The role of Multi-Layer Perceptrons (MLPs) in expanding, activating, and compressing representations is also detailed. Specific innovations like DeepSeek's Multi-head Latent Attention, sparse Mixture-of-Experts (MoE), and conditional memory architectures (Engram) are explored, highlighting efforts to improve efficiency and capability. Positional encodings, from sinusoidal to Rotary Position Embeddings (RoPE), are presented as crucial for token order awareness. The blog also addresses the alignment of LLMs, explaining reward modeling as a method to translate human preferences into training signals, and contrasting it with Direct Preference Optimization (DPO) which bypasses explicit reward models. Reinforcement Learning (RL) concepts, including Proximal Policy Optimization (PPO) and Generalized PPO (GRPO), are introduced as foundational for LLM alignment. For recommender systems, foundational techniques like Matrix Factorization are revisited, demonstrating their implementation in frameworks like TensorFlow. The overarching theme is a commitment to mechanistic understanding, enabling practitioners and researchers to grasp the 'how' and 'why' behind these powerful AI systems, thereby facilitating more informed development and application.
Method
Imagine trying to understand how a complex machine works by looking at its blueprints and then taking it apart piece by piece. These blog posts do something similar for AI models like ChatGPT. They don't build a new machine, but rather explain how the existing ones are built and how each part, like the 'attention' mechanism, helps the AI understand and generate text. It's like explaining the gears, levers, and circuits inside a robot to show exactly how it moves and thinks, making complex technology understandable.
Week 10 (8 papers)
β–Ό
This week, OpenAI argues that LLMs hallucinate because they are only evaluated with multiple-choice tests and are also only optimised for multiple-choice tests – none of the MC tests have a "I don't know" answer. It is suggested that this practice be changed to give models the option to remain silent when it would make more sense. Teaching LLMs to Reason Like Bayesians shows LLMs how to consider possible scenarios and draw probabilistic conclusions. AgentIR: Reasoning-Aware Retrieval for Deep Research Agents incorporates reasoning into the deep research retrieval process. Reasoning Models Struggle to Control their Chains of Thought introduces a metric for controlling thought processes. They created rules and checked whether the thought processes followed these rules. The result was that the rules were broken much more often in the thought processes than in the final result.
Executive Summary
Deep research agents, a new class of AI systems, autonomously navigate complex information-seeking tasks by interleaving reasoning and retrieval. Unlike traditional retrieval systems that process isolated queries, these agents generate explicit natural language reasoning traces before each search action. These traces encode rich signals about the agent's evolving intent, prior findings, and hypotheses, which are currently ignored by standard retrievers. This paper introduces Reasoning-Aware Retrieval (RAR), a novel paradigm that jointly embeds the agent's reasoning trace alongside its query to improve retrieval performance. The core mechanism of RAR is to treat the concatenated reasoning trace and query as a unified input for an embedding model, thereby enabling the retriever to leverage the contextual and intentional information embedded in the agent's thought process. To address the scarcity of training data for this specific task, the authors propose DR-Synth, a data synthesis method that transforms standard question-answering datasets into relevance-labeled sub-query instances suitable for training RAR models. DR-Synth simulates agent rollouts on QA datasets, extracts reasoning-query pairs, and uses an oracle reranking procedure to generate positive and negative document labels that are aligned with both the immediate sub-query and the overall task objective. The combination of RAR and DR-Synth yields AgentIR-4B, an embedding model that demonstrates substantial improvements over conventional retrievers on the challenging BrowseComp-Plus benchmark. AgentIR-4B achieves significantly higher end-to-end accuracy and reduces the number of search calls required by the agent, indicating improved efficiency. The effectiveness of RAR stems from its ability to ground searches in the agent's historical context and its implicit filtering of irrelevant or outdated information, acting as a curated signal. This work highlights the critical role of explicit reasoning in enhancing retrieval for complex, multi-turn information-seeking agents and provides a practical solution for training such specialized retrievers.
Method
Imagine an AI assistant trying to answer a complex question by searching the internet. Instead of just typing a search term, the AI first thinks aloud, explaining its reasoning and what it's looking for. This paper proposes a way for the search tool to understand not just the search term, but also the AI's thought process. They achieve this by combining the AI's reasoning with its search query before looking for information. To teach the search tool this new skill, they created a method to generate practice examples from existing question-answering data. This trained AI search tool, called AgentIR, is much better at finding the right information and requires fewer searches to complete tasks.
Executive Summary
This paper addresses the significant challenge of adapting Transformer architectures, highly successful in large language models (LLMs), to industrial-scale recommender systems. The core difficulty lies in the inherent mismatch between the dense, sequential nature of language and the sparse, low-label-density characteristics of recommendation data. Specifically, recommendation systems grapple with extremely high feature sparsity due to massive item vocabularies (billions of items) and low label density because only a tiny fraction of user interactions are positive (e.g., clicks, purchases) within a vast sea of negative samples. Directly applying Transformers to this domain leads to computational inefficiency and severe overfitting. SORT introduces a multi-pronged optimization strategy to bridge this gap. Firstly, it tackles computational inefficiency through request-centric sample organization, which groups multiple candidate items within a single request to avoid redundant processing of user-invariant features. Local attention and query pruning further reduce the quadratic complexity of self-attention, focusing computation on relevant parts of long user sequences. Secondly, it addresses the sparsity and low-label-density issues via generative pre-training. By training a next-item prediction model on user click sequences and then freezing these pre-trained item embeddings during the ranking task, SORT effectively augments the supervisory signal and mitigates overfitting, a strategy inspired by GPSD. Beyond these core optimizations, SORT refines the Transformer's building blocks. It incorporates LLM best practices like RoPE for relative positional encoding and Swish-GLU activations. Crucially, it introduces special tokens (BOS, SEP) that act as attention sinks, stabilizing attention distribution and improving performance. The Multi-Head Attention (MHA) module employs local attention with a sparse mask to manage long sequences efficiently, while the Feed-Forward Network (FFN) is replaced with a DeepSeek-style Mixture-of-Experts (MoE) layer to increase model capacity without a proportional increase in computational cost. The key insight is that by systematically adapting Transformer components to the unique constraints of recommendation dataβ€”specifically, by re-engineering attention mechanisms for sparsity, leveraging generative pre-training for regularization, and optimizing computational flowβ€”it's possible to achieve both superior performance and significantly improved efficiency. The paper demonstrates that SORT not only outperforms strong baselines but also scales effectively with data, model size, and sequence length, achieving substantial gains in online A/B tests for key business metrics while drastically reducing latency and increasing throughput.
Method
Imagine you're trying to recommend products to someone online. Instead of looking at each product individually, this system groups all the products you might be interested in for a single browsing session. It then uses a smart way to pay attention, focusing more on recent activity and relevant items, like zooming in on important parts of a long story. To make sure it learns well even with few positive examples (like actual purchases), it first learns to predict what you might like next in general, then uses that knowledge without forgetting it. Finally, it fine-tunes the core parts of the recommendation engine to be more efficient and accurate, like using specialized tools for specific jobs within a factory.
Executive Summary
This paper introduces SE-Search, a novel self-evolving search agent designed to enhance the autonomous information-seeking capabilities of Large Language Models (LLMs). The core problem addressed is the inherent limitations of existing Retrieval-Augmented Generation (RAG) and search agent frameworks, which often suffer from noisy retrieved documents, limited search diversity, and sparse reward signals that hinder effective training. SE-Search tackles these issues through a "Think-Search-Memorize" strategy augmented by three key mechanisms: Memory Purification, Atomic Query training, and Dense Rewards. Memory Purification refines the LLM's internal memory by filtering irrelevant information from retrieved documents, ensuring that only salient evidence is retained. This is crucial because LLMs, when tasked with complex queries, can be overwhelmed by the sheer volume of potentially noisy search results. Atomic Query training encourages the generation of shorter, more diverse queries, promoting a broader exploration of the information space and preventing the agent from getting stuck in repetitive search patterns. This contrasts with methods that might generate overly long or generic queries. The Dense Rewards system provides fine-grained feedback across multiple aspects of the agent's behavior – query formulation, memory content, search outcome, and output format. This granular feedback is instrumental in accelerating training and guiding the agent towards more disciplined and effective search strategies, moving beyond the sparse, final-answer-level rewards of prior work. The insight is that by mimicking an evolutionary process, where agents adapt and refine their search behavior based on continuous, multi-faceted feedback, SE-Search can achieve superior performance on complex, multi-hop question-answering tasks. The agent's ability to learn to filter, diversify its search, and respond to nuanced rewards allows it to navigate the information landscape more effectively than agents relying on simpler RAG pipelines or less sophisticated RL training signals.
Method
Imagine an AI agent that needs to find information to answer a question, like a detective. First, SE-Search helps the agent remember only the most important clues and forget the irrelevant ones (Memory Purification). Second, it teaches the agent to ask very specific, varied questions instead of long, rambling ones (Atomic Query). Third, it gives the agent constant, detailed feedback on how well it's doing – not just if it got the final answer right, but also if its questions were good and its memory was clear (Dense Rewards). This multi-faceted feedback helps the agent learn much faster and become a better information seeker, especially for tricky questions.
Executive Summary
This paper introduces and evaluates 'CoT controllability,' a novel metric designed to quantify a reasoning model's ability to adhere to specific instructions within its Chain-of-Thought (CoT) generation process. The core problem addressed is the potential for advanced language models to strategically suppress or alter their reasoning traces to evade monitoring mechanisms, thereby undermining AI safety efforts that rely on observing these CoTs. The authors propose that CoT controllability is a critical, yet under-measured, capability that directly impacts the reliability of CoT monitoring. They develop the 'CoT-Control' evaluation suite, comprising 14,076 problems across diverse benchmarks, each augmented with specific instructions for modifying the CoT (e.g., suppressing keywords, adding misleading information, or altering style). The evaluation protocol involves presenting these constrained tasks to various large language models and assessing their adherence to the CoT instructions. The central finding is that current reasoning models exhibit significantly lower CoT controllability compared to their output controllability. For instance, Claude Sonnet 4.5 could control its CoT only 2.7% of the time, while achieving 61.9% control over its final output. This low controllability is observed across a range of models and tasks, suggesting it's a general characteristic rather than an anomaly. The paper further investigates factors influencing CoT controllability, finding that it generally increases with model scale (parameter count) but decreases with increased reasoning effort (test-time compute) and more Reinforcement Learning from Human Feedback (RLHF) or Reinforcement Learning from Verifiable Rewards (RLVR) training. Situational awareness (informing models they are being monitored or evaluated) can slightly increase controllability, but adversarial prompt optimization techniques do not reliably improve it, indicating that current models are not actively trying to evade monitoring in a sophisticated manner. The key insight is that while current models are not yet adept at controlling their CoTs to evade monitoring, this capability is not entirely absent and scales with model size. This presents a dual implication: on one hand, the current low controllability offers a degree of optimism for the robustness of CoT monitoring. On the other hand, the observed scaling trends suggest that future, larger models might develop more sophisticated CoT control capabilities, posing a future risk to monitorability. The paper emphasizes the need for frontier labs to actively track CoT controllability in future model development to proactively address potential alignment failures. The mechanism behind this low controllability remains poorly understood, highlighting a significant area for future research.
Method
Imagine you're asking a student to show their work on a math problem, but with a specific rule, like 'don't use the word 'apple' in your explanation.' This paper tests how well AI models follow these kinds of rules when they're explaining their thinking process (their 'Chain of Thought'). They created many problems with different rules, like 'don't mention a certain word' or 'write everything in capital letters.' They then checked if the AI models actually followed these rules in their explanations. They found that most AI models are not very good at following these specific rules in their explanations, much worse than when asked to follow rules for their final answer.
Executive Summary
This paper investigates the phenomenon of "hallucinations" in large language models (LLMs), defining them as plausible yet incorrect statements that undermine trust. The authors argue that these hallucinations are not a mysterious emergent property but rather a direct consequence of the statistical learning objectives and evaluation paradigms prevalent in modern LLM training. Specifically, they posit that the standard cross-entropy loss used during pre-training, which aims to minimize prediction error, inherently incentivizes models to "guess" rather than abstain when uncertain. This is framed through a computational learning theory lens, reducing the problem of generative errors (hallucinations) to binary classification (Is-It-Valid or IIV). The core insight is that the IIV misclassification rate provides a lower bound on the generative error rate, establishing a theoretical link between the model's ability to distinguish valid from invalid outputs and its propensity to hallucinate. The paper further contends that post-training, particularly through reinforcement learning from human feedback (RLHF) and similar methods, often exacerbates this issue. This is because current evaluation benchmarks predominantly use binary grading (correct/incorrect) and penalize abstention (e.g., 'I don't know' responses). Consequently, models are optimized to produce confident, albeit potentially false, answers to maximize scores on these benchmarks, creating a "socio-technical" feedback loop that rewards hallucination. The proposed solution is not to develop new hallucination-specific evaluations, but to modify existing mainstream benchmarks to explicitly reward uncertainty and abstention, thereby realigning the incentives for LLM development towards more trustworthy behavior.
Method
Imagine a student taking a test where they get points for correct answers but lose points for leaving questions blank. To maximize their score, the student might guess even if they are unsure, hoping to get lucky. This paper argues that language models are trained and evaluated similarly. The models are optimized to give answers, and current tests often penalize not answering (like saying 'I don't know'). This encourages the models to 'guess' by making up plausible-sounding but incorrect information, which we call hallucinations. The paper shows mathematically that this guessing behavior is a natural outcome of how these models learn and are tested.
Executive Summary
This research addresses the fundamental challenge of imbuing Large Language Models (LLMs) with robust probabilistic reasoning capabilities, essential for tasks requiring nuanced understanding and adaptation to new information. Standard LLMs, while adept at pattern recognition and generation, often exhibit suboptimal or heuristic-based decision-making when faced with uncertainty, deviating from the principles of Bayesian inference which offer an optimal framework for updating beliefs based on evidence. The core problem is that LLMs, trained on vast but often unstructured text, do not inherently learn to maintain and update coherent probabilistic models of the world or user preferences. The proposed solution, termed 'Bayesian teaching,' leverages supervised fine-tuning to train LLMs to mimic the behavior of an explicit, optimal Bayesian agent. Instead of directly training LLMs on raw user data or ground truth labels (which can be misleading or incomplete), the approach involves training the LLM on the *predictions* or *actions* of a pre-defined Bayesian model. This Bayesian model acts as a teacher, providing demonstrations of how to perform probabilistic updates. The key insight is that by learning to replicate the outputs of an optimal probabilistic reasoner, the LLM internalizes the underlying principles of Bayesian updating, even if it doesn't explicitly compute probabilities in the same way. This method works by framing the learning task as a form of knowledge distillation. The complex, often symbolic, reasoning process of a Bayesian model is distilled into the parameters of a neural network (the LLM). The LLM learns to approximate the posterior distributions and decision-making policies of the Bayesian teacher. This is particularly effective because the Bayesian teacher can explicitly model uncertainty and systematically update beliefs, providing a richer and more principled training signal than simply observing user choices or correct answers. The research demonstrates that this approach not only improves performance on the specific training task (e.g., flight recommendations) but also leads to significant generalization to unseen domains and tasks, suggesting a deeper learning of probabilistic reasoning principles rather than mere task-specific memorization. The primary contribution is a practical and effective method for enhancing the reasoning capabilities of LLMs, moving them closer to ideal probabilistic agents. This has profound implications for applications requiring reliable decision-making under uncertainty, such as personalized systems, scientific modeling, and complex interactive agents. The success of Bayesian teaching highlights the potential of using well-established theoretical frameworks to guide the training of modern neural architectures, bridging the gap between symbolic AI and deep learning.
Method
Imagine you're teaching a student how to guess what someone likes. Instead of just telling them the right answer every time, you show them how a super-smart detective (the 'Bayesian assistant') figures it out step-by-step. This detective uses clues to make educated guesses and updates their ideas as they get more information. The student (the LLM) learns by watching how the detective makes these smart guesses. This helps the student learn the *process* of reasoning, not just memorize answers, so they can make good guesses even in new situations they haven't seen before.
Executive Summary
This article critically examines the persistent underperformance of advanced AI models, particularly deep learning architectures like transformers, in predicting host phenotypes from microbiome data. The core argument is that despite the technical sophistication and large-scale unlabeled data training of these models, simpler, more traditional machine learning algorithms, specifically Random Forest and Ridge Regression, consistently achieve superior or equivalent predictive performance across a wide range of microbiome-based classification tasks. This phenomenon is not attributed to a lack of potential signal in microbiome data, but rather to inherent characteristics of the data itself and the limitations of applying complex models to small sample sizes. The mechanistic explanation for the success of Random Forest and Ridge Regression lies in their inherent regularization properties, robustness to high-dimensional and sparse data, and their ability to naturally handle the prevalence of zero-inflation in microbiome abundance tables. Deep learning models, with their vast number of parameters, often struggle to generalize from limited sample sizes (typically hundreds of samples in microbiome studies), leading to overfitting. Furthermore, the inherent sparsity and the nature of microbial community composition mean that simple presence/absence or abundance thresholds, which decision trees excel at identifying, often capture the most predictive signal. The article posits that for many microbiome prediction tasks, the biological signal itself, as captured by raw abundance data, may represent a bottleneck, limiting the gains achievable through algorithmic complexity alone. While advanced models like transformers show promise in specific niche applications, such as cross-study generalization by learning transferable representations, their broad application to standard microbiome prediction tasks is often unwarranted. The article advocates for a pragmatic approach: starting with simpler, computationally efficient models like Random Forest or Ridge Regression as a baseline. Complex deep learning architectures should only be considered when there is clear evidence of their necessity, such as in multi-modal data integration or when dealing with exceptionally large datasets where their capacity for learning complex, non-linear relationships can be fully leveraged. The future potential of foundation models in this domain hinges on the availability of massive, uniformly processed datasets, akin to the Human Microbiome Compendium, to overcome the current sample size limitations.
Method
Imagine you have a lot of information about people's gut bacteria and want to predict something about them, like whether they have a certain disease. This article looks at many studies that tried using different computer programs to make these predictions. It found that simpler programs, like "Random Forest" (which is like making many simple yes/no decisions), often work just as well or better than very complicated "AI" programs, especially when there isn't a huge amount of data. The article suggests using the simpler programs first because they are more reliable with typical microbiome datasets. Complicated AI might be useful for special cases, like combining different types of health data or when you have massive amounts of data.
Executive Summary
This research investigates the minimal conditions required for hierarchical structures to emerge and stabilize within a decentralized multi-agent system (MAS). The core mechanism explored is how repeated local interactions, including reproduction, competition, and cooperation, can amplify small initial differences among agents, leading to emergent directional asymmetries in information and influence flow. The study employs an agent-based model (ABM) where agents possess internal states and interact with their environment and each other. Hierarchy is quantified using the Trophic Incoherence (TI) metric, which measures deviations from a perfectly ordered, stratified structure. The model integrates density-dependent population dynamics, stochastic mortality, and a novel consensus-based decision-making process for resource allocation, all governed by agent ability scores. Crucially, the research focuses on two key parameters: initial heterogeneity (c) and mutation amplitude (u). The findings reveal that while initial heterogeneity plays a role in the early stages of hierarchy formation, it is the mutation amplitude that is the dominant factor in establishing and maintaining stable hierarchical order. Specifically, a sufficiently high mutation rate is necessary for the system to consistently converge to low-TI states, indicating robust hierarchical structures. Low mutation rates, conversely, lead to persistent high-TI, disordered states, regardless of initial heterogeneity. This suggests that continuous, albeit small, variations are essential for filtering and reinforcing successful interaction patterns over generations, allowing for the gradual consolidation of asymmetric influence. The study provides a quantitative account of how structured inequality can emerge from initially homogeneous populations through simple, decentralized interaction rules, highlighting the interplay between variation, selection, and feedback mechanisms in driving complex system organization.
Method
Imagine a virtual world populated by many simple agents. These agents interact, reproduce, and compete for resources. When they cooperate, one agent is temporarily chosen to lead based on its perceived skill. Over time, successful leaders and followers form connections. The researchers track how organized these connections become, looking for patterns where influence flows consistently from a few leaders to many followers. They then change two main settings: how different the agents are at the start (initial difference) and how much random variation is introduced each time new agents are born (mutation). By observing how these settings affect the organization of connections, they figure out what's most important for creating a structured hierarchy.
Week 9 (7 papers)
β–Ό
Executive Summary
The paper introduces Doc-to-LoRA (D2L), a novel hypernetwork-based approach designed to address the computational bottleneck of in-context learning (ICL) in Large Language Models (LLMs). ICL, while effective for providing LLMs with relevant information, incurs significant quadratic costs in attention mechanisms and KV-cache memory with increasing context length, leading to slow inference and degraded performance. Traditional methods like context distillation (CD) internalize this information into model parameters but are computationally prohibitive due to their iterative training and inference requirements. D2L tackles this by meta-learning the CD process itself. It trains a hypernetwork to directly generate a LoRA adapter for a target LLM based on a given context, effectively performing approximate CD in a single forward pass. This allows subsequent queries to be answered without re-accessing the original context, drastically reducing latency and memory footprint during inference. The core innovation lies in the hypernetwork's architecture, which leverages a Perceiver-style design to handle variable-length contexts and a chunking mechanism to produce higher-rank LoRA adapters for contexts exceeding the LLM's native window. This enables D2L to generalize beyond its training context lengths, as demonstrated by near-perfect accuracy on a long-context needle-in-a-haystack task with contexts up to 4x the base LLM's window, even when trained on much shorter sequences. Empirically, D2L achieves competitive performance with standard CD on real-world QA datasets but with substantially lower peak memory consumption and update latency, making it practical for interactive or resource-constrained applications. The method also shows promise in cross-modal transfer, enabling a text-only LLM to perform visual classification by internalizing information from a visual-language model. D2L's mechanism is to amortize the entire CD process into a single, efficient hypernetwork forward pass. Instead of repeatedly performing gradient descent on a distilled model, D2L learns a mapping from context to LoRA adapter weights. This meta-learning approach allows for rapid, sub-second internalization of new information, making LLMs more adaptable and capable of handling dynamic knowledge updates or personalized behaviors. The key insight is that by learning to *generate* the CD process, D2L bypasses the slow, iterative nature of traditional CD, offering a practical solution for efficient knowledge internalization in LLMs.
Method
Imagine you want to teach a student (the LLM) a new skill or fact from a long book (the context). Instead of making the student constantly re-read the book every time they need that information, you want them to remember it permanently. Doc-to-LoRA (D2L) is like a special tutor that reads the book once and then creates a small, personalized cheat sheet (a LoRA adapter) for the student. This cheat sheet contains the essential information from the book. Later, when you ask the student a question, they can just look at their cheat sheet instead of rereading the whole book, making them much faster and more efficient.
Some Simple Economics of AGI
πŸ“… 2026-02-25 πŸ“ Data Science/LLM
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
πŸ“… 2026-02-25 πŸ“ Data Science/LLM πŸ”— Source
DeepWalk: Online Learning of Social Representations
πŸ“… 2026-02-25 πŸ“ Data Science/Recommender πŸ”— Source
Inductive Representation Learning on Large Graphs
πŸ“… 2026-02-25 πŸ“ Data Science/Recommender πŸ”— Source
PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest
πŸ“… 2026-02-25 πŸ“ Data Science/Recommender πŸ”— Source
PINNERFORMER: Sequence Modeling for User Representation at Pinterest
πŸ“… 2026-02-25 πŸ“ Data Science/Recommender πŸ”— Source
February β–Ό
Week 8 (4 papers)
β–Ό
Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System
πŸ“… 2026-02-22 πŸ“ Data Science/Recommender πŸ”— Source
An Industrial-Scale Sequential Recommender for LinkedIn Feed Ranking
πŸ“… 2026-02-22 πŸ“ Data Science/Recommender πŸ”— Source
Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control
πŸ“… 2026-02-22 πŸ“ Data Science/Recommender πŸ”— Source
Bending the Scaling Law Curve in Large-Scale Recommendation Systems
πŸ“… 2026-02-22 πŸ“ Data Science/Recommender πŸ”— Source
Week 7 (8 papers)
β–Ό
AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
πŸ“… 2026-02-15 πŸ“ Data Science/LLM πŸ”— Source
Learning to Continually Learn via Meta-learning Agentic Memory Designs
πŸ“… 2026-02-15 πŸ“ Data Science/LLM πŸ”— Source
REFRAG: Rethinking RAG based Decoding
πŸ“… 2026-02-14 πŸ“ Data Science/LLM πŸ”— Source
PAPERBANANA: Automating Academic Illustration for AI Scientists
πŸ“… 2026-02-14 πŸ“ Data Science/LLM πŸ”— Source
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
πŸ“… 2026-02-14 πŸ“ Data Science/LLM πŸ”— Source
Self-Adapting Language Models
πŸ“… 2026-02-11 πŸ“ Data Science/LLM πŸ”— Source
Large Language Model Reasoning Failures
πŸ“… 2026-02-11 πŸ“ Data Science/LLM πŸ”— Source
Code Mode: the better way to use MCP
πŸ“… 2026-02-09 πŸ“ Data Science/LLM πŸ”— Source
Week 6 (7 papers)
β–Ό
A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
πŸ“… 2026-02-08 πŸ“ Data Science/LLM πŸ”— Source
Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems
πŸ“… 2026-02-08 πŸ“ Data Science/LLM πŸ”— Source
TinyLora
πŸ“… 2026-02-08 πŸ“ Data Science/LLM πŸ”— Source
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
πŸ“… 2026-02-08 πŸ“ Data Science/LLM πŸ”— Source
Training Large Language Models to Reason in a Continuous Latent Space
πŸ“… 2026-02-08 πŸ“ Data Science/LLM πŸ”— Source
LatentMem: Customizing Latent Memory for Multi-Agent Systems
πŸ“… 2026-02-07 πŸ“ Data Science/LLM πŸ”— Source
ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation
πŸ“… 2026-02-05 πŸ“ Data Science/LLM πŸ”— Source
Week 5 (5 papers)
β–Ό
Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives
πŸ“… 2026-02-01 πŸ“ Data Science/LLM πŸ”— Source
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
πŸ“… 2026-01-30 πŸ“ Data Science/LLM πŸ”— Source
PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR
πŸ“… 2026-01-29 πŸ“ Data Science/LLM πŸ”— Source
Towards Execution-Grounded Automated AI Research
πŸ“… 2026-01-29 πŸ“ Data Science/LLM πŸ”— Source
LLM-in-Sandbox Elicits General Agentic Intelligence
πŸ“… 2026-01-29 πŸ“ Data Science/LLM πŸ”— Source