Beyond Words: Toward Multidimensional Concept Models for True AI Reasoning
I never liked the idea of "trust your gut," until I was older and it dawned on me:
Your limbic brain, that ancient sage with all its intrinsic knowledge, and ability to deeply comprehend your senses, has no faculty for language. Instead, it communicates with you by way of intuitions... gut feelings...
Trust your gut
In recent years, language models have captured the spotlight in artificial intelligence. Their ability to generate coherent text, mimic human reasoning in certain contexts, and process vast amounts of linguistic data has been nothing short of revolutionary. And yet, I believe that this obsession with language as the primary medium for AI is fundamentally misguided.
Today, I propose that the true path to genuine AI reasoning lies not in language models, but in what I call concept models, multidimensional systems that capture the distilled essence of our experiences through the underlying principles of our physical world.
The Limits of Language
At the heart of our premise is the realization that language is a projection, a convenient, socially agreed-upon system of labels we use to describe and communicate our experiences. When we speak, we select from an array of words and phrases to convey abstract concepts. But these labels, however diverse, are secondary to the core ideas they represent. Language is a tool, a surface-level projection that masks the deeper patterns governing reality.
The process of reasoning is fundamentally about pattern matching. It involves assessing relationships and evaluating, “Does this match that, and to what degree?” This process is entirely language agnostic.
Consider how a cup in a kitchen is for drinking, while a cup in a garden is a planter. The same object in different contexts takes on entirely different meanings, yet our brains resolve these ambiguities effortlessly. This kind of reasoning, mapping overlapping properties and applying them flexibly across contexts, is how novel problems are solved.
This is not an ability that requires language. Deaf and blind individuals, as well as creatures like crows and octopuses, demonstrate complex problem-solving skills without relying on sophisticated linguistic systems. True reasoning is about internalizing core concepts and navigating the world based on a deep, often subconscious, understanding of these patterns.
Concept Models: A New Paradigm for AI
While language models excel at processing and generating text based on statistical correlations, they remain bound to surface-level representations of information. They reassemble chains of thought that mirror human reasoning in form, not in essence. To capture true reasoning, we need models that internalize the fundamental aspects of our world, models that understand concepts beyond the veneer of language.
Enter concept models. Unlike language models, concept models would operate on a higher-dimensional, distilled representation of reality. They would not just process raw data, but abstract and integrate information at a conceptual level.
Consider the concept can hold water. This is not a single fact, but an emergent property composed of sub-concepts like is rigid, is non-porous, and has an approximate concave geometry. These building blocks form the foundation of understanding. A concept model does not simply recognize a cup as an object labeled “cup”, it understands what makes something a cup based on its functional properties. This allows it to recognize that a coconut shell, a folded leaf, or a deep rock cavity can also hold water.
This ability to deconstruct and recompose concepts is what allows humans to apply knowledge flexibly, solving new problems by analogy rather than memorization. This is precisely what today’s language models lack.
Learning Through Interaction: Lessons from Video Games
One of the most promising arenas for developing these multidimensional concept models is in simulated environments, video games for instance.
Video games already provide an approximation of the physical world where the laws of geometry, physics, and interaction are at play. Studies in embodied learning that several factors both internal and external (such as the body and the environment) play a role in the development of an agent's cognitive capacities.
In these virtual worlds, an AI can experiment, break things, and learn the consequences of its actions over billions of interaction configurations. This mirrors the way children explore their surroundings, purposefully testing boundaries to understand the world’s limits.
A concept model in such an environment would not simply regurgitate patterns found in text. Instead, it would actively engage with a rich “soup” of sensory information, distilling from it the key “concept primitives”. By interacting with its environment, the model could learn to predict outcomes, understand cause and effect, and build an internal representation of the physical world. This would create a true conceptual map from which higher reasoning abilities can emerge.
Distillation: The Key to Dimensional Reduction
A common misconception is that training these rich, multidimensional models requires enormous amounts of high-fidelity sensory data. I posit that the dimensionality we need is already within reach because it is, by its nature, a distilled abstraction. I call these “concept primitives.” Concept primitives make up the fundamental features of both objects and relationships. Complex concepts are simply recomposed concept primitives. Our concept model need not maintain a massive corpus of interactions and relationships, simply a set of irreducible axioms.
So how does distillation work?
Distillation happens at both the sensory organ and in the brain itself. For instance, your ocular nerves are constantly receiving peripheral vision information, but your brain never acts upon that information unless something novel appears (unexpected motion for instance). This pre-stage pattern matching can be thought of as “resolution,” or “focus.”
Imagine walking out of a dark room into the sunny outdoors. Your eyes receive the scene instantly, all at once, and it is momentarily confusing. It takes a moment for something interesting to occur… As time passes, each individual processing tick (approximately 25 times per second in humans), the eyes are averaging the scene and filtering out what is common.
It’s not until this “recalibration,” that you can rely on the inputs of your eyes. But even after this pre-filtering, the human brain does not record every pixel of every scene it sees but instead focuses on essential features and searches for novelty (things that break the average).
Concept models can learn to distill the complexity of raw data into a set of fundamental principles in much the same way, by focusing attention on novelty, and canceling the noise, mimicking human neurobiology.
This process of distillation transforms a dense, high-dimensional reality into a manageable and efficient representation. Instead of memorizing every instance of a “cup,” the model extracts the core properties that define what makes an object functionally a cup. This is the essence of intelligence, the ability to abstract and generalize beyond specific instances.
A Mathematical Framework for Concept Models
To move beyond abstract ideas and into actionable design, it is useful to formalize our approach with a mathematical framework. In our hypothetical system, we encode the world’s understanding into four distinct matrix layers, each capturing a different facet of reasoning. This layered representation allows us to transform raw sensory input into meaningful, goal-directed actions through a series of mathematically tractable steps.
Matrix Layer 1: Physical Properties (World Model Training)
This layer, denoted as M₁, is represented by a fixed length tensor composed of matrices that encapsulates snapshots of the current state of the world by encoding raw physical properties.
This fixed length tensor mimics short term memory, with the first matrix being the current focus or “attention,” and older matrices having a decay factor.
Each row or element of a matrix in M₁ represents an object or entity in the environment, characterized by a vector of features such as shape, texture, rigidity, and concavity. In effect, M₁ is a high-dimensional snapshot of “what exists” in the system at any given moment.
For example, if an object is a cup, its corresponding vector in M₁ might include values corresponding to its geometric dimensions, material properties, and spatial location. Mathematically, if we let xᵢ be the feature vector for object i, then:
M₁ = [x₁; x₂; …; xₙ]
where n is the number of objects detected in the environment.
This lends itself well to object detection and classification of sensory inputs like vision, pattern matching in audio inputs, etc.
Multimodal “sensory” input provides the encoded “features” for each object in high dimensional space, including their relation to each other (the more sensory inputs, the higher fidelity of the internal concept).
During training, we are able to build up a rich representation of not just the current “scene,” but causal effects, since the tensor’s “snapshots” represent temporal changes in the environment.
Back propagation reinforces this “concept primitive” and becomes a form of “inference” of cause and effect.
“Changes observed in previous matrices may lead to the present configuration”.
For instance, an object’s y axis is decreasing sequentially across the tensor (it is falling), until some point the object becomes more than one (it breaks).
With repeated exposure, it can be inferred that a rapidly decreasing y value in certain objects will likely decompose into fragments when reaching the ground (y=0).
This is a self discovered concept, composed of primitives, with predictive power is used later in M₃ during problem sovling.
Matrix Layer 2: Bounding Constraints (Physical Laws)
The second layer, M₂, encodes the immutable constraints imposed by the physical world, essentially the laws of physics and other environmental rules. M₂ can be thought of as a collection of functions or constraint matrices that limit how the elements of M₁ can interact. For instance, M₂ ensures that only objects with the right properties (e.g., non-porosity and concavity) can hold water.
Mathematically, M₂ defines a set of constraints or transformation rules C such that:
C(M₁) → valid configurations
In practice, M₂ might be implemented as a series of inequalities or equations that filter out physically impossible states, much like how a physics engine restricts the possible movements of objects in a simulation.
There is no need to re-invent the wheel here, as sufficiently sophisticated physics engines exist.
Matrix Layer 3: Unknown Transform (Inferred Transformation Sequence)
With a corpus of concept primitives, causal effects, and “previous experience,” reasoning is mostly a search problem.
Using cosine similarity and the notion of “next best fit” theory, achieving approximate configurations can likely be found using loss functions.
The third layer, M₃, represents the “unknown transform”, the computational engine that bridges the gap between the current state and the desired outcome. This layer is where the system learns to infer the necessary transformations or actions that can transition the environment from its current state toward a goal state.
Here, M₃ can be viewed as a set of functions or transformation matrices that, when applied to the representation in M₁ (subject to the constraints in M₂), moves the system closer to the goal. We can formalize this idea as:
T = M₃(M₁, M₂)
where T is the transformation set that minimizes the difference between the current state and a target state. Importantly, M₃ is “unknown” at the outset—it is discovered through interaction and experimentation, similar to how a crow learns by trial and error.
Matrix Layer 4: Desired State (Goal Representation)
Finally, the fourth layer, M₄, encodes the goal or the “desired state” of the system. This matrix outlines the target configuration that the AI aims to achieve. For example, if the objective is to transport water to a plant, M₄ would encapsulate the necessary conditions for the water’s successful delivery, such as being contained within an appropriate vessel.
The overall goal of the system is to find a transformation T (derived via M₃) that, when applied to the current state M₁ under the rules of M₂, produces an outcome that closely matches M₄. This can be expressed as an optimization problem:
Minimize: ‖ T(M₁, M₂) – M₄ ‖
Over: M₃
where ‖·‖ denotes a suitable norm that measures the discrepancy between the transformed state and the desired state.
Integrating the Layers: From Perception to Action
Together, these four layers provide a complete pipeline:
• M₁ (Perception): Capture the raw features of the environment.
• M₂ (Constraints): Apply the rules that govern physical interactions.
• M₃ (Inference): Discover the sequence of transformations needed to bridge the current state and the goal.
• M₄ (Goal): Define the target state the system strives to achieve.
This layered mathematical representation allows an AI system to operate more efficiently. Instead of processing unfiltered, high-dimensional data, the system works with distilled, meaningful representations. The “unknown transform” in M₃ is iteratively refined through interaction, much like how humans learn through feedback from the physical world.
A New Frontier
This vision challenges the conventional AI paradigm by urging us to shift our focus from language models, adept at handling symbols and text, to concept models that embody the distilled essence of our world. This approach has profound implications:
• Deeper Understanding: By internalizing the core principles of physical reality, AI can move beyond surface-level pattern matching to genuine, adaptable reasoning.
• Multimodal Integration: Concept models inherently embrace multiple modalities (visual, tactile, auditory) aligning more closely with the way humans and animals interact with the world.
• Embodied Learning: Training in simulated environments such as video games not only offers a practical testing ground but also mirrors the natural learning processes of exploration and experimentation.
Takeaway
While language models have undoubtedly propelled the field of AI forward, they remain a narrow slice of what true artificial reasoning might encompass.
By developing multidimensional concept models, systems that distill the complexities of the world into core, actionable principles, we open the door to a new era of AI, one that is capable of understanding, adapting, and reasoning in ways that are far more aligned with the realities of human experience.
The future of AI is not written in words, but in the rich interplay of concepts that define our world. It is time we moved beyond language to unlock the full potential of machine reasoning.
Empirical Evidence
Read about the experiments conducted on existing SOA LLMs that prove the viability of concept models and their extreme computational efficiency in comparison.