A new framework called Memento-Skills gives AI agents something most production systems still lack: the ability to improve their own skills without retraining the underlying model. Built by researchers at multiple universities, it treats memory as a living library of executable skills, not a pile of chat logs, and that makes a very practical promise for enterprise teams trying to run AI agents beyond toy demos.

The pitch is simple enough. Instead of fine-tuning a frozen large language model every time a workflow changes, the agent updates external skill files, tests them, and keeps going. That cuts out a lot of operational pain, especially in environments where the same tasks show up again and again with slightly different inputs – the sweet spot for automation, and a place where manual prompt editing gets old fast.

What Memento-Skills stores

Memento-Skills stores skills as structured markdown files that can include three things: declarative descriptions of what a skill does, prompts that steer reasoning, and executable code or helper scripts. The system’s ”Read-Write Reflective Learning” loop then lets the agent retrieve the most behaviorally useful skill, run it, inspect the result, and rewrite the artifact if something breaks. That is a more direct approach than the usual similarity-first retrieval trick, which can surface the wrong script just because two tasks sound alike.

There is also a built-in safeguard: after a skill is modified, the system generates a synthetic test case and runs it before saving the change to the global library. That is the kind of boring guardrail that actually matters in production, because autonomous code that can edit itself without tests is not ”agentic” so much as ”a compliance incident waiting to happen.”

Benchmark gains on GAIA and HLE

The researchers tested Memento-Skills on two benchmarks: General AI Assistants, or GAIA, and Humanity’s Last Exam, or HLE. The system ran on Gemini-3.1-Flash and beat a Read-Write baseline that could retrieve skills and collect feedback but did not let the system evolve its own memory.

On GAIA, accuracy rose to 66.0% from 52.3%, a 13.7 percentage point gain. On HLE, it climbed from 17.9% to 38.7%. The router mattered too: end-to-end task success reached 80% versus 50% for standard BM25 retrieval. Those are not cosmetic bumps. They suggest that skill selection based on task utility can beat plain semantic matching, especially when the ”right” memory is a reusable procedure, not a vaguely related paragraph.

The library growth was just as telling. Starting with five seed skills such as web search and terminal operations, the system expanded to 41 skills on GAIA and 235 on HLE. That kind of expansion hints at an emerging pattern across agent research: the best systems are increasingly less about giant one-shot prompts and more about building layered operational memory around a fixed model.

Where enterprises should be cautious

The researchers released the code on GitHub, which lowers the barrier for experimentation. But the paper is also pretty clear about where this fits best: structured workflows with repeatable patterns, not messy tasks that change shape every hour. If tasks are isolated or weakly related, the system has less to learn from. If they share structure, skill reuse compounds and the whole setup starts to look practical.

That distinction matters because a lot of companies are trying to sell ”agents” into environments that are really just random piles of requests. Memento-Skills looks better suited to back-office or process-heavy work, where learning can accumulate across jobs. Physical agents and long-horizon coordination, the paper suggests, still need more advanced systems, possibly involving multiple LLM agents working together.

The bigger question is governance. Automatic unit tests are a start, but self-modifying agent systems will need stronger judging and evaluation layers before anyone should trust them with real production autonomy. The industry is moving toward machines that rewrite their own operating habits; the winners will be the teams that keep that loop constrained, audited, and boring enough to survive contact with reality.

Source: Venturebeat

Leave a comment

Your email address will not be published. Required fields are marked *