Canonical3

Canonical3 is a data infrastructure project developing a universal data layer for artificial intelligence (AI). It aims to address issues of data fragmentation and unreliability by transforming raw, unstructured inputs into a standardized, verifiable, and agent-ready format. ^[1] ^[2]

Overview

Canonical3 is positioned as a foundational data layer designed to resolve a critical bottleneck in the deployment of AI systems. The project identifies that AI agents, despite rapid advancements in models, often exhibit unreliable or failed behavior due to their reliance on inconsistent and fragmented data sources. This issue, which the project's whitepaper terms the "Canonical Gap," stems from critical information being spread across disparate documents, logs, and sensor feeds without a common structure or format. ^[2]

The core solution proposed by Canonical3 is a framework called the Canonical Layer. This layer functions similarly to data normalization in relational databases, acting as an intermediary that standardizes information before it is consumed by AI agents or models. The objective is to establish a single, ordered, and trusted source of truth for data, enabling AI systems to operate with greater reliability, determinism, and auditability. The project was introduced publicly with the creation of its X (formerly Twitter) profile in December 2025 and the publication of its version 1.0 whitepaper on December 12, 2025. ^[3] ^[2]

According to project materials from early 2026, Canonical3 was reporting early adoption and traction metrics. These included over 50 terabytes of enterprise data undergoing active canonicalization, over 25 million events per day being normalized into its structured objects, and over 3,000 high-stakes procedures being mapped into computable workflows. The project is being developed by a core team of more than eight engineers and researchers. ^[1]

Architecture

Canonical3's architecture is designed as a foundational layer within a broader AI infrastructure stack and includes a detailed data processing pipeline to create its structured data objects.

AI Infrastructure Stack

The project situates itself as the base layer, or "Layer 1," in a three-layer conceptual model for AI infrastructure:

Layer 1: The Canonical Layer (Canonical3): This base layer is responsible for ingesting raw data from various sources and normalizing it into structured, canonical objects. It is intended to function as the trusted memory and intelligence foundation for the entire stack.
Layer 2: Infrastructure (Compute & Transport): The middle layer is composed of the systems that process and move the normalized data provided by the Canonical Layer.
Layer 3: Orchestration (Agents & Models): The top layer consists of AI agents and models that consume the structured intelligence from the lower layers to perform tasks and make decisions. ^[1]

Data Processing Pipeline

The whitepaper details a multi-stage pipeline for transforming raw inputs into Canonical Objects:

Ingestion: A unified loader ingests a wide variety of data formats, including documents (PDF, DOCX, HTML), datasets (CSV, logs), and real-time sensor streams (GPS, IMU, audio, video, IoT), attaching source metadata for traceability.
Decomposition: The raw data is broken down into its fundamental components. Textual information is parsed into assertions, rules, and procedures, while sensor signals are analyzed to identify events, states, and environmental features.
Normalization: The system applies principles from database normalization to semantic data. This stage aims to remove redundancy, enforce atomicity, and resolve inconsistencies to create a clean, logical representation.
Schema Alignment: The decomposed and normalized data is then mapped to formal, domain-specific Canonical Schemas. These schemas provide a strict structure for data in fields such as healthcare procedures, financial compliance, or robotics.
Attribute Typing: Each data attribute is assigned a specific type, unit, confidence score, and provenance information. This enriches the data, making it more explicit and machine-readable.
Object Generation: Finally, the fully processed data is used to generate immutable, versioned Canonical Knowledge Objects (CKOs) and Canonical Sensory Objects (CSOs), which are then indexed for querying. ^[2]

The architecture also incorporates a vector-graph hybrid index combined with canonical schema catalogs. This system is designed to support both semantic searches (for finding conceptually similar information) and deterministic, structured queries (for retrieving exact data based on defined schemas). ^[2]

Products

Canonical3's offerings are centered around its core data layer, the structured data objects it produces, and a specialized data notation language and toolset.

The Canonical Layer

The primary product is the Canonical Layer itself, a foundational platform that serves as an intermediary between raw data sources and AI applications. It standardizes diverse inputs into a shared, structured format, aiming to ensure that all data consumed by AI agents is consistent, reliable, and traceable. ^[1]

Canonical Objects

The Canonical Layer represents all processed information as two primary types of structured data primitives, designed to be predictable and interpretable by AI agents.

Canonical Knowledge Objects (CKOs)

CKOs represent static knowledge extracted from sources such as documents, policies, and procedural manuals. They are designed to capture rules, regulations, and operational guidelines in a clear, versioned, and machine-readable format. This allows AI agents to reason over a stable and explicit set of rules rather than interpreting unstructured text. ^[1]

Canonical Sensory Objects (CSOs)

CSOs represent dynamic, real-world data derived from event streams and environmental sensors. These objects normalize inputs from sources like GPS, IMU (Inertial Measurement Unit), and other sensor feeds. This process ensures consistent units, timing, and semantics, creating a standardized and unified view of real-world events for an AI system. ^[1]

CanL3 Notation Language

Canonical3 provides an open-source data format and platform called CanL3, which stands for Canonical3 Notation Language. CanL3 is a human-readable, text-based format positioned as a more compact and efficient alternative to JSON, particularly for optimizing Large Language Model (LLM) token usage. Performance benchmarks claim the format is up to 36% smaller than JSON by byte size and uses 45% fewer tokens with certain models. ^[4]

The CanL3 platform includes several components:

Developer Tools: The project provides a suite of tools, including an interactive Command-Line Interface (CLI) for data exploration, a VS Code extension for syntax highlighting, and TypeScript-first APIs for serialization, querying, and data modification.
CanL3 Schema Language (TSL): CanL3 includes TSL, its own schema definition language used in .schema.CanL3 files. TSL allows for the definition of data types and the enforcement of 13 different validation constraints, such as required, pattern (regex), unique, and min/max value or length. ^[4]

Features

The Canonical3 framework and its associated CanL3 toolset are designed to provide a range of capabilities for building reliable AI systems.

Platform Features

The core data layer is intended to enable the following systemic qualities:

Data Normalization: The system ingests and standardizes heterogeneous data types into a shared format, creating a single source of truth.
Reliable Agent Behavior: By providing AI systems with a consistent and unified data source, the platform aims to prevent conflicting interpretations and ensure more predictable agent behavior.
Deterministic Workflows: The use of a single, trusted data state allows agent actions and decisions to follow clear, predictable rules based on structured inputs.
Auditability by Design: Built-in versioning and data lineage allow all outcomes to be traced back to the specific source data and version used, making all AI operations fully auditable.
Composability: The shared data foundation is designed to allow multiple, distinct AI agents to coordinate and operate on the same verified information, enabling the creation of more complex, interoperable systems. ^[1]

CanL3 Technical Features

The CanL3 notation language and its tooling offer specific technical advantages for data handling:

Serialization: The format is designed for efficiency, aiming for a 32-45% smaller size than JSON in both bytes and tokens. It remains human-readable and provides round-trip-safe conversion with JSON.
Querying and Navigation: The platform supports JSONPath-like queries, filter expressions, wildcards, and tree traversal. An LRU (Least Recently Used) cache is included to speed up repeated queries.
Data Modification: A suite of APIs is available for CRUD (Create, Read, Update, Delete) operations, bulk operations like merge and update, and change tracking via a diff function.
Indexing and Performance: CanL3 supports Hash, BTree, and compound indexes for fast lookups and is optimized for stream processing of large files with low memory usage.
Advanced Optimization: The platform incorporates numerous compression and optimization techniques, including Dictionary Encoding, Delta Encoding, Run-Length Encoding, Bit Packing, and Numeric Quantization. It also features a tokenizer-aware optimization strategy for LLMs.
Schema and Validation: CanL3 supports runtime data validation against schemas defined in TSL, with options for strict mode enforcement and the auto-generation of TypeScript types from schemas. ^[4]

Ecosystem

As of early 2026, the Canonical3 ecosystem was in its early stages of development, with a focus on integrations and developer community engagement. The project reports having live integrations with over ten agent frameworks and various "core systems," although specific names of these frameworks and systems have not been publicly disclosed. ^[2]

A key part of the ecosystem is the open-source CanL3 component. The source code for the notation language, parsers, and developer tools is available on GitHub under an MIT license, allowing developers to build with and contribute to the format. The project maintains a public presence through its official website, GitHub repository, and social media channels to engage with the developer community. ^[4]

Use Cases

The Canonical3 framework is presented as applicable across various industries where high-stakes, data-driven automation is required. The following are potential use cases cited by the project for its platform and the CanL3 format:

Healthcare Triage: Normalizing patient history documents, lab results, and real-time monitoring data for consistent clinical assessment by AI agents.
Robotics: Standardizing and merging data from multiple sensors, such as SLAM (Simultaneous Localization and Mapping), IMU, and cameras, to create a unified world model for autonomous navigation.
Compliance and Finance: Automating the verification of actions against documented policies and complex regulatory rules by converting them into computable CKO workflows.
Supply Chain Management: Unifying and reconciling logistics manifests, shipping updates, and inventory data from different partners and systems into a single, coherent view.
Enterprise AI: Vectorizing and structuring internal knowledge bases, documents, and logs to create a reliable "enterprise brain" for internal chatbots and agents to query.
Spatial Operations: Merging and normalizing telemetry data from satellites, drones, and other geospatial sensors for applications in agriculture, defense, and environmental monitoring.
LLM Prompt Engineering: Using the compact CanL3 format to provide structured data to Large Language Models, aiming to reduce token costs and API expenses.
Data Engineering: Facilitating stream processing for large datasets in data pipelines and using the format for structured log aggregation to simplify querying and analysis. ^[1] ^[4]

Tokenomics

Canonical3 plans to incorporate a "Tokenized Incentive Layer" to create a self-sustaining economy around the creation and maintenance of high-quality canonical data. The whitepaper also refers to this as an "Optional Incentive Layer" and mentions a "Governance" model, suggesting a token may be planned to facilitate decentralized network operations. ^[1] ^[2]

Token Utilities

The proposed utilities for the project's native token focus on rewarding data contributors:

Creator Rewards: To provide perpetual reward flows to individuals and organizations that create and contribute valuable canonical datasets to the network.
Query-Based Yield: To generate token-based yield for data owners each time their canonical data is queried, incentivizing the curation and maintenance of useful intelligence. ^[1]

Allocation and Governance

As of early 2026, specific details regarding the project's token—including its name, ticker, total supply, allocation model, and governance structure—have not been specified in available materials. ^[1] ^[2]

Confirmed Partnerships

Although the project states it has "live integrations across core systems," no specific enterprise partners or project collaborations have been officially named in the provided documentation as of early 2026. ^[1] ^[2]

Key People

Canonical3 was built by a core team of more than eight engineers and researchers with experience in AI systems, data infrastructure, and applied machine learning.

Lavrentin Arutyunyan serves as the Chief Data Scientist for the project. He holds a PhD in mathematical and physical sciences from Lomonosov Moscow State University. His background is in applied mathematics and large-scale data systems. Prior to his role at Canonical3, Arutyunyan led teams responsible for AI evaluation, Reinforcement Learning from Human Feedback (RLHF) datasets, and production analytics at Yandex. At Canonical3, he leads the approach to data quality, alignment, and deterministic evaluation, ensuring that agents operate on reliable and verifiable intelligence. ^[1]

Subscribe to wiki

Share wiki

Bookmark

Wiki Details

Profile Summary