Technical Architecture

How VoxelTox Works

We treat drug discovery as a microscopic perception and control system — fusing 1D, 2D, 3D, and 4D data the way autonomous vehicles fuse LiDAR, cameras, and radar for spatial understanding.

1D

Large Language Model

Semantic Reasoning Engine

Fine-tuned LLM serving as the semantic reasoning backbone — parsing biological macromolecular sequences, compound structures, and global biomedical literature to generate informed molecular hypotheses.

Parses FASTA protein sequences and SMILES chemical notation

Processes millions of PubMed research papers for contextual reasoning

Establishes logical-semantic associations between functional targets

Proposes candidate molecules based on learned biochemical principles

Provides the hypothesis generation layer in our bi-directional system

2D

Vision Transformer

Structural Image Understanding

High-resolution Cryo-EM images processed by Vision Transformer architectures to extract spatial features that bridge the gap between 2D observations and 3D structural reality.

Processes raw Cryo-EM micrographs and 2D density maps

Hierarchical spatial feature extraction from electron microscopy data

Resolves sub-angstrom structural details invisible to sequence-only methods

Provides complementary 2D → 3D reconstruction priors for the World Model

Leverages the rapidly growing Cryo-EM data repositories (EMDB)

3D/4D

World Model

Spatial-Temporal Micro-Environment

Inspired by autonomous driving occupancy forecasting — voxelizes protein binding pockets into continuous spatial grids and predicts molecular dynamics without brute-force quantum calculations.

Converts protein binding pockets into continuous spatial voxel grids

Learns from massive structural evolution datasets (PDB, UniProt dynamics)

Predicts molecular spatial occupancy changes over the time dimension

Models conformational collapse and induced-fit dynamics

Validates LLM proposals via spatial collision & energy stability testing

Unified

Sensor Fusion Layer

Cross-Dimensional Feature Alignment

Cross-attention mechanisms map multi-dimensional data — 1D sequences, 2D microscopy, 3D structures, and 4D dynamics — into a shared latent space, enabling full-spectrum bi-directional reasoning.

Cross-attention aligns representations across all four data dimensions

1D logic + 2D features + 3D geometry + 4D dynamics → unified embedding

Shared latent space enables holistic reasoning no single modality can achieve

LLM proposes → Vision verifies → World Model validates → loop refines

Achieves deterministic prediction rather than probabilistic screening

Key Differentiators

Why this approach is fundamentally different

Bi-directional Self-Correction

Unlike linear pipelines that generate once and output, VoxelTox implements a closed-loop between reasoning (LLM) and simulation (World Model), iteratively refining until convergence.

LLM(propose) → WorldModel(simulate) → feedback → LLM(refine) → ... → converged_lead

Result: Only molecules that survive physics-grounded simulation reach the pipeline. False positive rate drops by orders of magnitude.

Learned Physics, Not Brute-Force

Instead of solving quantum mechanical equations for every atom, we learn molecular behavior from millions of observed structural transitions. The model develops physical intuition — predicting where atoms will be, not tracing how they got there.

Traditional MD

Days–Weeks

AlphaFold

Minutes (static)

VoxelTox

Seconds (4D)

Architecturally Multi-Modal

Biological data spans vastly different dimensionalities. Most approaches collapse high-dimensional data to fit single-modality models, losing spatial information irreversibly. VoxelTox is natively multi-dimensional.

1D

Sequences

2D

Cryo-EM

3D

Voxels

4D

Dynamics