Generative AI Track

Goal
Focus: Master the foundations of generative AI—from text representation and transformers to diffusion models and speech technologies. Build the skills to work with LLMs, image generation systems, and audio pipelines.
Curriculum
Foundations of Natural Language Processing
1. Text Representation & Embeddings
Understand how text is converted to numerical representations, from classical methods to modern dense embeddings.
Key Topics:
- Text preprocessing: tokenization, normalization, subword tokenization (BPE)
- Bag-of-Words, TF-IDF, and their limitations
- Word2Vec (CBOW, Skip-gram) and GloVe intuition
- Embedding geometry: similarity, analogies, bias
- Contextual embeddings concept (leading to transformers)
Action Items:
- Build BoW and TF-IDF vectors and compare feature distributions
- Train Word2Vec on a corpus, visualize with t-SNE
- Explore pre-trained embeddings with Gensim
Stanford CS224N — NLP with Deep Learning (Lectures 1-2)
scikit-learn Text Feature Extraction Guide
Word2Vec Paper
Gensim Word Embeddings Tutorial
2. Attention & Transformers
Master the transformer architecture that powers all modern language and diffusion models.
Key Topics:
- Self-attention mechanism and intuition
- Query, Key, Value formulation
- Multi-head attention and why it matters
- Positional encoding strategies
- Transformer encoder/decoder architecture
- Layer normalization and residual connections
Action Items:
- Implement scaled dot-product attention from scratch in PyTorch
- Visualize attention weights using a pre-trained model (bert-base-uncased)
- Build a minimal self-attention layer as nn.Module
The Illustrated Transformer (Jay Alammar)
Visualizing Neural Machine Translation
Transformer Neural Networks Explained
Attention Is All You Need
Harvard Annotated Transformer
3. Large Language Models
Understand how LLMs work, from pretraining to inference, and learn to work with them effectively.
Key Topics:
- Language modeling: next-token prediction
- Causal (GPT) vs masked (BERT) models
- Scaling laws and emergent capabilities
- Tokenizers: BPE, WordPiece, SentencePiece
- Working with HuggingFace Transformers
- Prompt engineering fundamentals
- Fine-tuning basics: LoRA and PEFT concepts
Action Items:
- Follow Karpathy's nanoGPT to build a mini LLM
- Run inference with various HuggingFace models
- Design effective prompts for different tasks
- Fine-tune a small model with LoRA
Let's Build GPT (Karpathy)
HuggingFace NLP Course (Ch 1-4)
LoRA Paper
Diffusion Models Foundations
4. Generative Model Foundations
Build intuition for generative modeling before diving into diffusion-specific concepts.
Key Topics:
- Generative vs discriminative models
- Latent space and representation learning
- Variational Autoencoders (VAE) intuition
- GANs overview: generator, discriminator, adversarial training
- Limitations of VAEs and GANs leading to diffusion
Action Items:
- Implement a simple VAE on MNIST
- Explore latent space interpolation
- Understand mode collapse in GANs
Generating Sound with Neural Networks (VAE Explained)
VAE Tutorial (UvA Deep Learning)
Generative Adversarial Networks (GANs)
5. Diffusion Model Theory
Understand the mathematical foundations of denoising diffusion probabilistic models.
Key Topics:
- Forward process: adding noise progressively
- Reverse process: learning to denoise
- Noise schedules and variance
- Training objective: simplified loss function
- Score matching and score-based models connection
- Sampling: DDPM vs DDIM
Action Items:
- Implement forward diffusion process
- Train a simple DDPM on a 2D dataset or MNIST
- Experiment with different noise schedules
How AI Images Actually Work
HuggingFace Diffusion Course (Unit 1-2)
A Practical Introduction to Diffusion Models
6. Modern Diffusion Architectures
Learn the architectures powering Stable Diffusion, DALL-E, and other state-of-the-art systems.
Key Topics:
- Stable Diffusion architecture
- U-Net for denoising
- Text conditioning with CLIP
- ControlNet for image conditioning
- Practical generation techniques
Action Items:
- Run Stable Diffusion with different prompts and settings
- Experiment with guidance scale effects
- Use ControlNet for structured generation
- Fine-tune with LoRA on custom concepts
The U-Net Explained
The Illustrated Stable Diffusion
Stable Diffusion and ControlNet
ControlNet Tutorial (HuggingFace)
LoRA Training for Diffusion
Speech Technologies
7. Audio Fundamentals
Build essential knowledge of audio signal processing for speech applications.
Key Topics:
- Digital audio: sampling rate, bit depth
- Time vs frequency domain representations
- Spectrograms and Short-Time Fourier Transform
- Mel scale and mel spectrograms
- MFCCs for speech feature extraction
- Audio preprocessing: normalization, VAD
Action Items:
- Load and visualize audio with librosa
- Generate spectrograms and mel spectrograms
- Extract MFCCs from speech samples
- Build a basic preprocessing pipeline
HuggingFace Audio Course
Audio Signal Processing for Machine Learning
8. Speech-to-Text (ASR)
Learn automatic speech recognition from fundamentals to modern end-to-end models.
Key Topics:
- ASR pipeline overview
- CTC loss and alignment-free training
- End-to-end ASR architectures
- Whisper: architecture, capabilities, multilingual
- Streaming vs offline recognition
- Evaluation metrics: WER, CER
- Domain adaptation and fine-tuning
Action Items:
- Set up Whisper transcription pipeline
- Compare model sizes on accuracy/speed
- Test on accented and noisy audio
- Build a real-time transcription demo
HuggingFace ASR Tutorial (Unit 5)
End-to-End ASR System
Whisper Paper Explanation
9. Text-to-Speech (TTS)
Explore speech synthesis from text normalization to neural vocoders.
Key Topics:
- TTS pipeline: text → acoustic features → waveform
- Text normalization and G2P conversion
- Acoustic models: Tacotron, FastSpeech
- Neural vocoders: WaveNet, HiFi-GAN
- Modern TTS: VITS, Bark, Coqui TTS
- Prosody control and expressive speech
- Voice cloning with minimal samples
Action Items:
- Run inference with Coqui TTS / Bark
- Compare quality across TTS models
- Experiment with prosody and emotion
- Try zero-shot voice cloning
HuggingFace TTS Tutorial (Unit 6)
TTS Course
Capstone Project
Build an automated video translation pipeline that combines ASR, LLM translation, and TTS to dub English videos into Russian with synchronized audio.