Skip to main content

Generative AI Track

Generative AI Track

Goal

Focus: Master the foundations of generative AI—from text representation and transformers to diffusion models and speech technologies. Build the skills to work with LLMs, image generation systems, and audio pipelines.

Curriculum

Foundations of Natural Language Processing

1. Text Representation & Embeddings

Understand how text is converted to numerical representations, from classical methods to modern dense embeddings.

Key Topics:

Text preprocessing: tokenization, normalization, subword tokenization (BPE)
Bag-of-Words, TF-IDF, and their limitations
Word2Vec (CBOW, Skip-gram) and GloVe intuition
Embedding geometry: similarity, analogies, bias
Contextual embeddings concept (leading to transformers)

Action Items:

Build BoW and TF-IDF vectors and compare feature distributions
Train Word2Vec on a corpus, visualize with t-SNE
Explore pre-trained embeddings with Gensim

Stanford CS224N — NLP with Deep Learning (Lectures 1-2)

scikit-learn Text Feature Extraction Guide

Word2Vec Paper

Gensim Word Embeddings Tutorial

2. Attention & Transformers

Master the transformer architecture that powers all modern language and diffusion models.

Key Topics:

Self-attention mechanism and intuition
Query, Key, Value formulation
Multi-head attention and why it matters
Positional encoding strategies
Transformer encoder/decoder architecture
Layer normalization and residual connections

Action Items:

Implement scaled dot-product attention from scratch in PyTorch
Visualize attention weights using a pre-trained model (bert-base-uncased)
Build a minimal self-attention layer as nn.Module

The Illustrated Transformer (Jay Alammar)

Visualizing Neural Machine Translation

Transformer Neural Networks Explained

Attention Is All You Need

Harvard Annotated Transformer

3. Large Language Models

Understand how LLMs work, from pretraining to inference, and learn to work with them effectively.

Key Topics:

Language modeling: next-token prediction
Causal (GPT) vs masked (BERT) models
Scaling laws and emergent capabilities
Tokenizers: BPE, WordPiece, SentencePiece
Working with HuggingFace Transformers
Prompt engineering fundamentals
Fine-tuning basics: LoRA and PEFT concepts

Action Items:

Follow Karpathy's nanoGPT to build a mini LLM
Run inference with various HuggingFace models
Design effective prompts for different tasks
Fine-tune a small model with LoRA

Let's Build GPT (Karpathy)

HuggingFace NLP Course (Ch 1-4)

LoRA Paper

Diffusion Models Foundations

4. Generative Model Foundations

Build intuition for generative modeling before diving into diffusion-specific concepts.

Key Topics:

Generative vs discriminative models
Latent space and representation learning
Variational Autoencoders (VAE) intuition
GANs overview: generator, discriminator, adversarial training
Limitations of VAEs and GANs leading to diffusion

Action Items:

Implement a simple VAE on MNIST
Explore latent space interpolation
Understand mode collapse in GANs

Generating Sound with Neural Networks (VAE Explained)

VAE Tutorial (UvA Deep Learning)

Generative Adversarial Networks (GANs)

5. Diffusion Model Theory

Understand the mathematical foundations of denoising diffusion probabilistic models.

Key Topics:

Forward process: adding noise progressively
Reverse process: learning to denoise
Noise schedules and variance
Training objective: simplified loss function
Score matching and score-based models connection
Sampling: DDPM vs DDIM

Action Items:

Implement forward diffusion process
Train a simple DDPM on a 2D dataset or MNIST
Experiment with different noise schedules

How AI Images Actually Work

HuggingFace Diffusion Course (Unit 1-2)

A Practical Introduction to Diffusion Models

6. Modern Diffusion Architectures

Learn the architectures powering Stable Diffusion, DALL-E, and other state-of-the-art systems.

Key Topics:

Stable Diffusion architecture
U-Net for denoising
Text conditioning with CLIP
ControlNet for image conditioning
Practical generation techniques

Action Items:

Run Stable Diffusion with different prompts and settings
Experiment with guidance scale effects
Use ControlNet for structured generation
Fine-tune with LoRA on custom concepts

The U-Net Explained

The Illustrated Stable Diffusion

Stable Diffusion and ControlNet

ControlNet Tutorial (HuggingFace)

LoRA Training for Diffusion

Speech Technologies

7. Audio Fundamentals

Build essential knowledge of audio signal processing for speech applications.

Key Topics:

Digital audio: sampling rate, bit depth
Time vs frequency domain representations
Spectrograms and Short-Time Fourier Transform
Mel scale and mel spectrograms
MFCCs for speech feature extraction
Audio preprocessing: normalization, VAD

Action Items:

Load and visualize audio with librosa
Generate spectrograms and mel spectrograms
Extract MFCCs from speech samples
Build a basic preprocessing pipeline

HuggingFace Audio Course

Audio Signal Processing for Machine Learning

8. Speech-to-Text (ASR)

Learn automatic speech recognition from fundamentals to modern end-to-end models.

Key Topics:

ASR pipeline overview
CTC loss and alignment-free training
End-to-end ASR architectures
Whisper: architecture, capabilities, multilingual
Streaming vs offline recognition
Evaluation metrics: WER, CER
Domain adaptation and fine-tuning

Action Items:

Set up Whisper transcription pipeline
Compare model sizes on accuracy/speed
Test on accented and noisy audio
Build a real-time transcription demo

HuggingFace ASR Tutorial (Unit 5)

End-to-End ASR System

Whisper Paper Explanation

9. Text-to-Speech (TTS)

Explore speech synthesis from text normalization to neural vocoders.

Key Topics:

TTS pipeline: text → acoustic features → waveform
Text normalization and G2P conversion
Acoustic models: Tacotron, FastSpeech
Neural vocoders: WaveNet, HiFi-GAN
Modern TTS: VITS, Bark, Coqui TTS
Prosody control and expressive speech
Voice cloning with minimal samples

Action Items:

Run inference with Coqui TTS / Bark
Compare quality across TTS models
Experiment with prosody and emotion
Try zero-shot voice cloning

HuggingFace TTS Tutorial (Unit 6)

TTS Course

Capstone Project

Build an automated video translation pipeline that combines ASR, LLM translation, and TTS to dub English videos into Russian with synchronized audio.

Goal
Curriculum
Capstone Project

Roadmap Assistant