AI Memory · LongMemEval

NodePlus memory system scores 93.8% on LongMemEval

The industry’s toughest long-term AI memory benchmark.

Most AI systems forget what happened five minutes ago. NodePlus doesn’t. Our memory layer was measured against a rigorous 500-question benchmark that tests whether AI can recall, reason about, and act on information from conversation histories spanning weeks and months.

But the score isn’t the point. NodePlus is a working business system that writes and ships code, repairs bugs, builds new features, and answers questions about your operation, your regulations, and your compliance. We took the test to show what it can do; the real benefit is the whole system working together.

Request a briefing See the scores

Overall accuracy

93.8%

500 questions · 6 categories

Independent judge-model scoring

§ IWhat It's Actually For

We didn’t build NodePlus to win a benchmark. We built it to run a business.

LongMemEval measures one layer of the system: long-term memory. We ran it to show that foundation is solid. The real benefit is everything below working as one.

Writes & ships code

Generates production code against your stack and conventions, then opens the pull request for review.

Repairs bugs

Traces failures to root cause across the codebase and proposes the minimal, targeted fix.

Builds new features

Takes a request from spec to a working, reviewed change, with the right context already loaded.

Knows your business

Contracts, SOPs, schemas, decisions, and operational history, searchable and cited to the source.

Tracks regulations

Multi-state regulatory requirements kept current and scoped to the right jurisdiction for each question.

Supports compliance

Accurate recall of test results, audit history, and the verified record behind every answer.

Memory is what makes the rest reliable. Code, fixes, features, and answers are only as good as what the system remembers about your business. The 93.8% below is evidence that the memory holds, not the product itself.

§ IIInside The Benchmark

This isn’t a toy demo. It tests six memory capabilities that matter in real workflows.

LongMemEval breaks long-term memory into the distinct skills a business actually relies on, and scores each one independently.

93.6%

Knowledge Updates

When facts change over time, does the system use the latest information?

86.5%

Multi-Session Recall

Can it pull relevant context from conversations that happened days or weeks ago?

100%

Single-Session (Assistant)

Does it track what the assistant said earlier in the same conversation?

97.1%

Single-Session (User)

Does it track what the user said earlier in the same conversation?

93.3%

Preference Tracking

When a user states preferences, does the system remember and apply them?

97.0%

Temporal Reasoning

Can it answer when things happened, in what order, and how they relate over time?

§ IIIHow We Got Here

A multi-model ensemble, not a single monolithic model.

NodePlus is a system of specialized components working together. Each plays to a different strength of long-term memory.

Retrieval-Augmented Generation

RAG with a BM25 candidate union for broad recall, so relevant context is found whether the query matches by meaning or by keyword.

Canonical Claims Layer

A dedicated store of verified, high-confidence facts the system can lean on without re-deriving them from raw history.

QUBO-Based Deduplication

Noise is eliminated before the model ever sees the context, so the prompt carries signal rather than redundant fragments.

AP Memory Lattice

Associative, long-term knowledge storage that surfaces non-obvious connections a flat search would miss.

Category-Aware Routing

Each query is directed to the model configuration with the strongest track record for that type of question.

Three independent answer candidates are generated, and the best-performing source is selected per question category. This ensemble approach means NodePlus’s memory accuracy exceeds what any single model achieves alone.

§ IV · Built for Regulated Industries

When your AI tells an auditor a test finished on March 15th, it had better be right.

NodePlus is purpose-built for industries where getting it right matters: regulated labs, food safety, financial services, and compliance-heavy operations. Our anchor customer is a multi-state regulated lab business where accurate recall of SOPs, test results, and compliance history is not optional.

93.8% accuracy on a benchmark designed to stress-test exactly these recall scenarios gives our customers confidence that the system is production-ready.

Regulated labsFood safetyFinancial servicesCompliance ops

§ VThe Benchmark

What LongMemEval actually measures

LongMemEval was developed by researchers to evaluate long-term memory in AI dialogue systems. It consists of 500 questions across six categories, each designed to test a different facet of memory: factual recall, temporal ordering, preference retention, multi-session continuity, and reasoning over evolving information.

Scoring uses automated evaluation by an independent judge model, ensuring consistent and reproducible results. Every question has a verified ground-truth answer derived from the original conversation data.

500

Questions

Built to run your business, proven on the benchmark

See NodePlus on your operation, from shipping code and fixing bugs to regulations and compliance, and book a briefing on the system behind the 93.8% result.

Request a briefing See the SWE-bench Pro results