The industry’s toughest long-term AI memory benchmark.
Most AI systems forget what happened five minutes ago. NodePlus doesn’t. Our memory layer was measured against a rigorous 500-question benchmark that tests whether AI can recall, reason about, and act on information from conversation histories spanning weeks and months.
But the score isn’t the point. NodePlus is a working business system that writes and ships code, repairs bugs, builds new features, and answers questions about your operation, your regulations, and your compliance. We took the test to show what it can do; the real benefit is the whole system working together.
LongMemEval measures one layer of the system: long-term memory. We ran it to show that foundation is solid. The real benefit is everything below working as one.
Generates production code against your stack and conventions, then opens the pull request for review.
Traces failures to root cause across the codebase and proposes the minimal, targeted fix.
Takes a request from spec to a working, reviewed change, with the right context already loaded.
Contracts, SOPs, schemas, decisions, and operational history, searchable and cited to the source.
Multi-state regulatory requirements kept current and scoped to the right jurisdiction for each question.
Accurate recall of test results, audit history, and the verified record behind every answer.
Memory is what makes the rest reliable. Code, fixes, features, and answers are only as good as what the system remembers about your business. The 93.8% below is evidence that the memory holds, not the product itself.
LongMemEval breaks long-term memory into the distinct skills a business actually relies on, and scores each one independently.
When facts change over time, does the system use the latest information?
Can it pull relevant context from conversations that happened days or weeks ago?
Does it track what the assistant said earlier in the same conversation?
Does it track what the user said earlier in the same conversation?
When a user states preferences, does the system remember and apply them?
Can it answer when things happened, in what order, and how they relate over time?
NodePlus is a system of specialized components working together. Each plays to a different strength of long-term memory.
RAG with a BM25 candidate union for broad recall, so relevant context is found whether the query matches by meaning or by keyword.
A dedicated store of verified, high-confidence facts the system can lean on without re-deriving them from raw history.
Noise is eliminated before the model ever sees the context, so the prompt carries signal rather than redundant fragments.
Associative, long-term knowledge storage that surfaces non-obvious connections a flat search would miss.
Each query is directed to the model configuration with the strongest track record for that type of question.
Three independent answer candidates are generated, and the best-performing source is selected per question category. This ensemble approach means NodePlus’s memory accuracy exceeds what any single model achieves alone.
NodePlus is purpose-built for industries where getting it right matters: regulated labs, food safety, financial services, and compliance-heavy operations. Our anchor customer is a multi-state regulated lab business where accurate recall of SOPs, test results, and compliance history is not optional.
93.8% accuracy on a benchmark designed to stress-test exactly these recall scenarios gives our customers confidence that the system is production-ready.
LongMemEval was developed by researchers to evaluate long-term memory in AI dialogue systems. It consists of 500 questions across six categories, each designed to test a different facet of memory: factual recall, temporal ordering, preference retention, multi-session continuity, and reasoning over evolving information.
Scoring uses automated evaluation by an independent judge model, ensuring consistent and reproducible results. Every question has a verified ground-truth answer derived from the original conversation data.
Fewer "I don’t have that information" responses when the system should know the answer.
Accurate recall of decisions, preferences, and facts from prior conversations.
Reliable temporal reasoning: the system knows what happened when.
Confidence that your AI assistant is working from verified, up-to-date information.
See NodePlus on your operation, from shipping code and fixing bugs to regulations and compliance, and book a briefing on the system behind the 93.8% result.