Technical Architecture
How VoxelTox Works
We treat drug discovery as a microscopic perception and control system — fusing 1D, 2D, 3D, and 4D data the way autonomous vehicles fuse LiDAR, cameras, and radar for spatial understanding.
Large Language Model
Semantic Reasoning Engine
Fine-tuned LLM serving as the semantic reasoning backbone — parsing biological macromolecular sequences, compound structures, and global biomedical literature to generate informed molecular hypotheses.
Parses FASTA protein sequences and SMILES chemical notation
Processes millions of PubMed research papers for contextual reasoning
Establishes logical-semantic associations between functional targets
Proposes candidate molecules based on learned biochemical principles
Provides the hypothesis generation layer in our bi-directional system
Vision Transformer
Structural Image Understanding
High-resolution Cryo-EM images processed by Vision Transformer architectures to extract spatial features that bridge the gap between 2D observations and 3D structural reality.
Processes raw Cryo-EM micrographs and 2D density maps
Hierarchical spatial feature extraction from electron microscopy data
Resolves sub-angstrom structural details invisible to sequence-only methods
Provides complementary 2D → 3D reconstruction priors for the World Model
Leverages the rapidly growing Cryo-EM data repositories (EMDB)
World Model
Spatial-Temporal Micro-Environment
Inspired by autonomous driving occupancy forecasting — voxelizes protein binding pockets into continuous spatial grids and predicts molecular dynamics without brute-force quantum calculations.
Converts protein binding pockets into continuous spatial voxel grids
Learns from massive structural evolution datasets (PDB, UniProt dynamics)
Predicts molecular spatial occupancy changes over the time dimension
Models conformational collapse and induced-fit dynamics
Validates LLM proposals via spatial collision & energy stability testing
Sensor Fusion Layer
Cross-Dimensional Feature Alignment
Cross-attention mechanisms map multi-dimensional data — 1D sequences, 2D microscopy, 3D structures, and 4D dynamics — into a shared latent space, enabling full-spectrum bi-directional reasoning.
Cross-attention aligns representations across all four data dimensions
1D logic + 2D features + 3D geometry + 4D dynamics → unified embedding
Shared latent space enables holistic reasoning no single modality can achieve
LLM proposes → Vision verifies → World Model validates → loop refines
Achieves deterministic prediction rather than probabilistic screening
Key Differentiators
Why this approach is fundamentally different
Bi-directional Self-Correction
Unlike linear pipelines that generate once and output, VoxelTox implements a closed-loop between reasoning (LLM) and simulation (World Model), iteratively refining until convergence.
Result: Only molecules that survive physics-grounded simulation reach the pipeline. False positive rate drops by orders of magnitude.
Learned Physics, Not Brute-Force
Instead of solving quantum mechanical equations for every atom, we learn molecular behavior from millions of observed structural transitions. The model develops physical intuition — predicting where atoms will be, not tracing how they got there.
Traditional MD
Days–Weeks
AlphaFold
Minutes (static)
VoxelTox
Seconds (4D)
Architecturally Multi-Modal
Biological data spans vastly different dimensionalities. Most approaches collapse high-dimensional data to fit single-modality models, losing spatial information irreversibly. VoxelTox is natively multi-dimensional.
1D
Sequences
2D
Cryo-EM
3D
Voxels
4D
Dynamics