NodePlus scores 66.6% on SWE-bench Pro, a benchmark of real software-engineering tasks across four languages.
Here is the part that matters: every system that scores higher on the public leaderboard is a closed, paid, cloud model. NodePlus runs entirely on local models, with no per-token bills and no third-party AI provider in the loop, and still lands among the leaders.
The score is not the point. NodePlus is a working system that writes and ships code, repairs bugs, and builds features against your stack. We took the test to show the engine is real.
SWE-bench Pro measures one thing: can the system resolve real issues in real repositories. We ran it to prove the coding engine holds up under pressure.
Generates production code against your stack and conventions, then opens the pull request for review.
Traces failures to root cause across the codebase and proposes the minimal, targeted fix.
Takes a request from spec to a working, reviewed change, with the right context already loaded.
Tested on Go, Python, TypeScript, and JavaScript repositories, not a single-language demo.
The whole pipeline runs on local models that NodePlus operates. No external API, no per-token metering, no surprise bill.
Source, prompts, and context are processed only by local models. Nothing is ever sent to a third-party AI provider.
On the public 35-model leaderboard, only two systems score higher than NodePlus, and both are closed, paid, cloud models. Every open-weights model on the board scores lower.
Reference figures from the public SWE-bench Pro leaderboard (35 models). Competing systems shown by tier; NodePlus from run pro_731_v2.
Scores below are on the instances that executed cleanly. Across all of them the system resolved 81.6% of issues, with the largest gains on the hardest languages.
JavaScript repositories are the current frontier at 43.9% (64 / 146), up +20.6pp from raw retrieval, and the focus of ongoing indexer work. A further 134 instances across the hardest repositories could not be scored at all, because the benchmark harness fetched shallow clones missing the target commits; those are infrastructure failures, not quality results, and are excluded from the rates above.
NodePlus pairs a local model with a structured retrieval and memory pipeline. The same hardware, without that pipeline, scores far lower.
Maps the repository into a structured candidate set so the right files surface before any code is generated.
Enriches each task with RAG retrieval, a BM25 keyword union, a canonical-facts layer, and an associative memory lattice.
Long-term, cross-file knowledge that surfaces non-obvious connections a flat search would miss.
Generation runs entirely on open models that NodePlus hosts, with a local embedding model and no external API calls.
The Gateway adds +16.6 points over raw retrieval, and up to +39 points on the hardest languages. A local model that sits mid-pack on its own is lifted past frontier paid systems, with no tokens purchased.
Every result on this page was produced on 100% local models with no tokens purchased. For regulated labs, financial services, and compliance-heavy operations, that means proprietary code and customer data are never sent to a third-party AI provider, and costs do not scale with usage.
You get a SWE-bench Pro result that outscores most paid frontier APIs, with no third-party model in the loop, at a fixed and predictable cost.
SWE-bench Pro evaluates whether an AI system can resolve real, verified issues drawn from production open-source repositories. Each instance is a genuine engineering task: the system must read the codebase, locate the problem, and produce a patch that passes the project’s own tests.
This run covered 731 instances across four languages and eleven repositories. Of those, 134 could not be scored because of infrastructure failures: shallow clones missing the target commit on the hardest repositories, plus a handful of errors. On the 597 instances that executed cleanly, NodePlus resolved 81.6%. The headline 66.6% counts every non-valid instance as a miss.
Anti-contamination controls blocked the benchmark from writing into or retrieving from production memory, so no result reflects prior exposure to the test set.
Production-grade code generation and bug fixes, validated against a public engineering benchmark.
Results that outscore most paid frontier APIs, with no per-token cost.
Your source code, prompts, and context are never sent to a third-party AI provider.
Predictable, fixed cost: the bill does not grow every time the team ships.
See NodePlus ship code, fix bugs, and build features against your stack, and book a briefing on the local pipeline behind the 66.6% result.