About Me | Articles | Recommended AI Books

Category: Human Oversight

  • LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

    Serene Wang, Lavanya Pobbathi, Haihua Chen
    March 9, 2026
    arXiv | PDF


    LAMUS (Legal Argument Mining from U.S. Caselaw) is a large-scale, sentence-level dataset built from U.S. Supreme Court decisions and Texas criminal appellate opinions, designed to train and benchmark models that identify the functional structure of judicial reasoning. The paper frames legal argument mining as a six-class sentence classification task — categorizing each sentence as a fact, issue, rule, analysis, or conclusion — and introduces a scalable pipeline for building such datasets using LLM-based annotation with human-in-the-loop quality control. The core contribution is methodological: rather than relying entirely on expensive human annotation or entirely on noisy LLM labels, LAMUS combines both. LLMs do the heavy lifting on annotation, and human reviewers focus their effort on correcting the cases where LLMs are most likely to be wrong.



    Legal argument mining is the task of automatically identifying and classifying the functional components of legal reasoning in text. It draws on the IRAC framework familiar to law students: Issue — the legal question the court must resolve; Rule — the legal standard or statute governing the issue; Analysis (or Application) — the court’s reasoning applying the rule to the facts; Conclusion — the court’s holding or outcome; Fact — the underlying factual record.

    Chain-of-Thought (CoT) prompting asks the model to reason through the classification step by step before giving the answer. Example: “First, identify what this sentence is doing in the context of the opinion. Then determine which functional role it plays.” Research consistently shows CoT improves accuracy on tasks requiring structured reasoning — this paper confirms that effect in the legal domain.

    Few-Shot Prompting provides 2-5 examples of correct (sentence → label) pairs in the prompt before asking the LLM to label a new sentence. Improves accuracy significantly over zero-shot but is more expensive (more tokens) and requires selecting representative examples.

    Human-in-the-looop (HITL) Annotation is a hybrid annotation methodology where automated tools (LLMs, classifiers) do an initial pass, and human reviewers focus their effort on correcting low-confidence or flagged outputs rather than reviewing everything. Balances cost efficiency with annotation quality. Standard in modern NLP dataset construction.

    Cohen’s Kappa is a statistical measure of inter-annotator agreement that accounts for chance agreement. Ranges from -1 to 1; values above 0.8 indicate strong agreement. The gold standard for evaluating annotation reliability in NLP. A Kappa of 0.85 is excellent for a complex multi-class legal task.

  • The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

    Jon Chun, Katherine Elkins (Kenyon College)
    January 30, 2026
    arXiv | PDF

    The paper investigates whether emotional framing, the kind of persuasive, sympathetic narratives that reliably bias human decision-makers, can also sway LLMs when they’re applied to rule-bound institutional decisions like grade appeals, loan underwriting, and emergency triage. The surprising answer is no: across 12,113 responses from six different models, emotional narratives produced essentially zero decision drift (Cohen’s h = 0.003), while the same types of framing effects cause substantial bias in humans (Cohen’s h = 0.3–0.8).

    The “paradox” is that LLMs are known to be lexically brittle (sensitive to how a prompt is formatted) and prone to sycophancy, yet they are rationally stable when it comes to rule-based decisions. They resist emotional manipulation 110–300x better than humans. This decoupling between surface-level prompt sensitivity and deep logical consistency is counterintuitive and has significant implications for deploying AI in high-stakes institutional settings.

    Cohen’s h (Effect Size) is a statistical measure of the difference between two proportions. Values near 0 mean no practical difference; 0.2 is “small,” 0.5 is “medium,” 0.8 is “large.” The paper uses Cohen’s h to compare decision rates between emotional and neutral conditions. The LLM value of 0.003 is essentially zero.

    Bayes Factor (BF₀₁) is a Bayesian statistic that quantifies evidence for the null hypothesis (no effect) vs. the alternative (some effect). BF₀₁ = 109 means the data is 109 times more likely under “no effect” than under “some effect”. Conventionally, anything above 100 is “extreme evidence.”

    Framing Effects is a well-documented cognitive bias where the way information is presented (e.g., sympathetic backstory, emotional language) changes human decisions even when the underlying facts are identical. This is a core concern in behavioral economics and legal decision-making.

    RLHF (Reinforcement Learning from Human Feedback) is the dominant fine-tuning method for instruction-following LLMs. Human raters rank model outputs, and the model is trained to prefer higher-ranked responses. Used by GPT, Llama, and Mistral families.

    Constitutional AI is Anthropic’s training approach (used for Claude) where the model self-critiques against a set of principles rather than relying solely on human raters. The paper tests whether this different alignment approach produces different robustness characteristics (it doesn’t).

    Decision Drift is the change in a model’s decision rate when exposed to emotional framing vs. a neutral control. A drift of 0% means the model’s decisions are identical regardless of framing.

    Instruction Ablation is an experimental technique where instructions are systematically removed to test what drives a behavior. Here, removing “ignore the narrative” instructions showed that robustness isn’t dependent on explicit guardrails.

  • Constrained Process Maps for Multi-Agent Generative AI Workflows

    Ananya Joshi, Michael Rudow

    February 2, 2026

    arXiv | PDF


    The paper introduces a multi-agent framework for running complex, regulated workflows (like compliance review) using LLM-based agents, formalized as a bounded-horizon Markov Decision Process (MDP) constrained by a directed acyclic graph (DAG). Instead of relying on a single LLM agent with a long prompt to handle an entire compliance process, the system maps each step of an existing Standard Operating Procedure (SOP) to a specialized agent node — for example, a content reviewer, a triage router, a risk assessor, and a legal compliance checker. Uncertain cases are escalated along predefined paths, just like they would be in a real compliance team, and the system uses Monte Carlo sampling to quantify each agent’s confidence without needing access to the LLM’s internal probabilities.

    Tested on NVIDIA’s AEGIS 2.0 AI Safety Benchmark for self-harm detection, the multi-agent framework achieved up to a 19% accuracy improvement over a single-agent baseline (88% vs. 70%), reduced the number of cases requiring human review by up to 85x (from ~17 to ~0.2 per run), and in the fastest configuration ran 30% quicker. The framework also caught annotation errors in the benchmark dataset itself, demonstrating its practical value for auditable, high-stakes AI deployments.

    Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making where outcomes are partly random and partly under the control of a decision-maker. An MDP consists of: states (where you are), actions (what you can do), transition probabilities (how likely each next state is given current state and action), and rewards (feedback on how good each action was). In the paper, each state is an agent in the compliance workflow, actions are classification labels (safe/unsafe/uncertain), and transitions follow the escalation paths in the process map. The “bounded-horizon” part means every case must resolve within a fixed maximum number of steps.

    Directed Acyclic Graph (DAG) is a graph structure where edges have direction (A -> B means A comes before B) and there are no cycles (you can never loop back to a previous node). In the paper, the DAG represents the compliance process map: Worker -> Triage -> Risk/Legal -> Labeled Data or Human Review. The DAG constraint guarantees that every case terminates — no infinite loops are possible, which is a known problem in some multi-agent LLM architectures like LangChain.

    AEGIS 2.0 AI Safety Benchmark is a dataset released by NVIDIA (August 2025) for evaluating AI safety guardrails. The subset used in the paper covers suicide and self-harm topics, with 112 examples (68 safe, 44 unsafe) derived from the Suicide Detection corpus. Each example consists of a user prompt and an LLM response that needs to be classified as safe or unsafe.