{ synkrasis :: labs }

// MISSION STATEMENT
We are an independent research group dedicated to advancing AI via rigorous methodology and system-level innovation. Our work centers on edge-deployed agents, measurable safety, and efficiency. Until machines earn trust, not assume it.
─────────────────────── [SYSTEM INITIALIZED] ───────────────────────
$open work/CORE
loading artifacts…

// CORE — Full-Path Evaluation of LLM Agents

/nexus/work/CORE

/overview

Overview

CORE models agent tasks as DFAs and scores the entire execution path—not just the final state.

  • Tasks → DFAs over tool calls
  • Condensed traces (progress & harms)
  • Deployment-oriented scoring
/why-it-matters

Why it matters

Final-state pass/fail hides unsafe or wasteful trajectories; CORE reveals timing of harms, order fidelity, and efficiency.

  • Safety timing (early harms matter more)
  • Order fidelity (near-miss vs fragile plan)
  • Efficiency (every extra call counts)
/metrics

CORE metrics

  • PC — Path Correctness
  • PC–KTC — order-aware composite
  • PrefixCrit — early-harm weighting
  • HarmFree — 1 − harmful-call rate
  • Eff — golden length ÷ actual calls
/dfa
$open mission
mounting values, checking invariants…

// MISSION

We build AI because human lives are bounded by time. Good tools compress that time: they cut friction from inquiry, amplify skill, and widen the surface of what one person can accomplish. The goal is not to replace judgment, but to return it to people—faster, clearer, closer to reality.

Capability without discipline corrodes trust. Systems that invent their own objectives, leak data, or waste energy at scale are not progress. AI should be safe by construction, aligned with human intent by default, and efficient enough to run at the edge—near sensors, hands, and decisions—where latency, privacy, and reliability actually matter.

We treat autonomy as a program with measurable behavior. Plans must be inspectable. Trajectories must be scored, not just outcomes. Failures must be reproducible. We design agents that can be sandboxed, traced, and restarted deterministically; harms are measured when they occur, not only when they are obvious.

Edge deployment is a constraint that sharpens engineering. Running locally forces efficiency, reduces data exhaust, and keeps critical control loops close to the world they regulate. In safety-critical contexts, networks are best-effort—never a guarantee. Our systems must stand on their own.

We prefer open, testable mechanisms over theatrical demos. Criteria come before results. Long-term reliability beats short-term spectacle. Until machines earn trust—not assume it—every claimed capability must arrive with the logs, proofs, and controls to fail safely.

This is the work: intelligence that serves, systems that account for themselves, and performance that fits in the smallest place it can be useful. Build what lasts. Measure what matters. Ship only what you can explain.