AI Model Types
A visual guide to the foundational architectures powering modern AI — from recurrent networks to mixture of experts.
Recurrent Neural Network
RNNMemory through time
RNNs process sequential data by maintaining a hidden state that carries information from previous time steps. At each step, the network reads an input and updates its hidden state, allowing it to model temporal dependencies. Variants like LSTMs and GRUs add gating mechanisms to better capture long-range patterns.
Key Ideas
- ▸Hidden state acts as a memory of past inputs
- ▸Processes sequences one element at a time (left to right)
- ▸LSTMs/GRUs solve the vanishing gradient problem with gates
- ▸Largely superseded by transformers for most NLP tasks
Convolutional Neural Network
CNNPatterns in space
CNNs apply learnable filters (kernels) that slide across input data to detect local patterns — edges, textures, shapes — and hierarchically compose them into higher-level features. Pooling layers downsample spatial dimensions, making the network translation-invariant. The architecture excels at tasks where local spatial structure matters.
Key Ideas
- ▸Convolutional filters learn local feature detectors
- ▸Weight sharing makes the network efficient and translation-invariant
- ▸Pooling progressively reduces spatial resolution
- ▸Stacking layers builds a hierarchy: edges → parts → objects
Transformer
TransformerAttention is all you need
Transformers process entire sequences in parallel using self-attention, where every token attends to every other token to compute contextual representations. This replaces recurrence with a global view of the input. The architecture consists of stacked layers of multi-head self-attention followed by position-wise feed-forward networks, with residual connections and layer normalization.
Key Ideas
- ▸Self-attention computes pairwise relationships between all tokens
- ▸Multi-head attention captures different relationship types
- ▸Positional encodings inject sequence order information
- ▸Parallelizable — no sequential bottleneck like RNNs
Large Language Model
LLMScale unlocks emergence
LLMs are transformer-based models trained on massive text corpora to predict the next token. At sufficient scale (billions of parameters), they exhibit emergent capabilities like in-context learning, chain-of-thought reasoning, and instruction following. They are typically pre-trained with self-supervised learning and then fine-tuned or aligned with RLHF.
Key Ideas
- ▸Autoregressive: predict the next token given all previous tokens
- ▸Pre-trained on internet-scale text data
- ▸Emergent abilities appear at scale (reasoning, code generation)
- ▸Aligned via RLHF / constitutional AI for safety and helpfulness
Vision-Language Model
VLMSeeing and reading together
VLMs combine a vision encoder (often a ViT or CNN) with a language model, enabling the system to understand and reason about images alongside text. The two modalities are fused through cross-attention, projection layers, or shared embedding spaces. This allows tasks like image captioning, visual question answering, and multimodal reasoning.
Key Ideas
- ▸Separate encoders for vision and language, fused through cross-modal attention
- ▸Contrastive pre-training aligns image and text representations (CLIP-style)
- ▸Can follow instructions grounded in visual context
- ▸Foundation for multimodal assistants and agents
Vision-Language-Action Model
VLASee, understand, act
VLAs extend vision-language models into the physical world by adding an action output head. They take visual observations and language instructions as input and directly output low-level robot actions (joint angles, end-effector positions). This enables robots to follow natural language commands grounded in what they see, bridging perception, language understanding, and motor control.
Key Ideas
- ▸End-to-end: perception + language → robot actions
- ▸Trained on robot demonstration datasets with language annotations
- ▸Generalize across tasks via language conditioning
- ▸Key step toward general-purpose robotic manipulation
World Model
World ModelImagine before you act
World models learn an internal representation of the environment's dynamics — they can predict what will happen next given a state and action, without actually taking the action. This lets agents plan by 'imagining' future trajectories in latent space. They are central to model-based reinforcement learning and emerging approaches to video generation and physical reasoning.
Key Ideas
- ▸Learn a predictive model of environment dynamics
- ▸Enable planning by simulating future states internally
- ▸Reduce sample complexity by learning in imagination
- ▸Core idea: the brain builds a world model too (Yann LeCun's JEPA)
Mixture of Experts
MoEDivide and conquer
MoE architectures contain many parallel 'expert' sub-networks and a learned router that directs each input token to only a few experts. This achieves massive parameter counts while keeping compute cost manageable — the model is sparse at inference time. MoE layers typically replace the feed-forward blocks in a transformer, enabling models to scale to trillions of parameters without proportional compute increase.
Key Ideas
- ▸Router network selects top-k experts per token (sparse activation)
- ▸Total parameters scale independently from per-token compute
- ▸Load balancing loss prevents expert collapse
- ▸Enables much larger models at similar inference cost
Generative Adversarial Network
GANForger vs. detective
GANs consist of two networks trained in opposition: a generator that creates synthetic data and a discriminator that tries to tell real from fake. Through this adversarial game, the generator learns to produce increasingly realistic outputs. GANs pioneered high-fidelity image synthesis and remain important for tasks requiring sharp, realistic generation.
Key Ideas
- ▸Min-max game between generator and discriminator
- ▸Generator learns the data distribution without explicit density modeling
- ▸Training can be unstable — mode collapse is a key challenge
- ▸Largely overtaken by diffusion models for image generation
Diffusion Model
DiffusionFrom noise to signal
Diffusion models learn to generate data by reversing a gradual noising process. During training, noise is incrementally added to data; the model learns to predict and remove that noise at each step. At generation time, it starts from pure noise and iteratively denoises to produce a clean sample. This approach yields state-of-the-art image, video, and audio generation.
Key Ideas
- ▸Forward process gradually corrupts data with Gaussian noise
- ▸Model learns the reverse (denoising) process
- ▸Guided generation via classifier-free guidance and conditioning
- ▸Can be accelerated with fewer steps (DDIM, consistency models)
Autoencoder / VAE
AutoencoderCompress, then create
Autoencoders learn to compress data into a compact latent representation and then reconstruct it. The encoder maps input to a low-dimensional bottleneck; the decoder maps it back. Variational Autoencoders (VAEs) add a probabilistic twist — the latent space is regularized to be a smooth distribution, enabling generation of new samples by sampling from it.
Key Ideas
- ▸Encoder compresses input; decoder reconstructs it
- ▸Bottleneck forces the model to learn meaningful representations
- ▸VAEs regularize latent space for smooth interpolation and sampling
- ▸Used as components in larger systems (latent diffusion, VQ-VAE)
State Space Model
SSMSequences without attention
State Space Models map input sequences to output sequences through a continuous latent state, inspired by classical control theory. Unlike transformers, they process sequences with linear recurrence that can be parallelized during training via convolution. Models like Mamba add input-dependent (selective) state transitions, achieving competitive performance with transformers at lower computational cost for long sequences.
Key Ideas
- ▸Continuous-time state transition: h'(t) = Ah(t) + Bx(t)
- ▸Can be computed as recurrence (inference) or convolution (training)
- ▸Linear scaling with sequence length vs. quadratic for attention
- ▸Selective state spaces (Mamba) add input-dependent dynamics