Skip to main content

Generative AI Track

Generative AI Track

Goal

Focus: Master the foundations of generative AI—from text representation and transformers to diffusion models and speech technologies. Build the skills to work with LLMs, image generation systems, and audio pipelines.

Curriculum

Foundations of Natural Language Processing

1. Text Representation & Embeddings

Understand how text is converted to numerical representations, from classical methods to modern dense embeddings.

Key Topics:

  • Text preprocessing: tokenization, normalization, subword tokenization (BPE)
  • Bag-of-Words, TF-IDF, and their limitations
  • Word2Vec (CBOW, Skip-gram) and GloVe intuition
  • Embedding geometry: similarity, analogies, bias
  • Contextual embeddings concept (leading to transformers)

Action Items:

  • Build BoW and TF-IDF vectors and compare feature distributions
  • Train Word2Vec on a corpus, visualize with t-SNE
  • Explore pre-trained embeddings with Gensim
course
beginner

Stanford CS224N — NLP with Deep Learning (Lectures 1-2)

3-4 hours
tutorial
beginner

scikit-learn Text Feature Extraction Guide

2-3 hours
paper
intermediate

Word2Vec Paper

2-3 hours
tutorial
beginner

Gensim Word Embeddings Tutorial

2-3 hours

2. Attention & Transformers

Master the transformer architecture that powers all modern language and diffusion models.

Key Topics:

  • Self-attention mechanism and intuition
  • Query, Key, Value formulation
  • Multi-head attention and why it matters
  • Positional encoding strategies
  • Transformer encoder/decoder architecture
  • Layer normalization and residual connections

Action Items:

  • Implement scaled dot-product attention from scratch in PyTorch
  • Visualize attention weights using a pre-trained model (bert-base-uncased)
  • Build a minimal self-attention layer as nn.Module
blog
intermediate

The Illustrated Transformer (Jay Alammar)

2-3 hours
blog
intermediate

Visualizing Neural Machine Translation

1-2 hours
course
intermediate

Transformer Neural Networks Explained

1-2 hours
paper
advanced

Attention Is All You Need

3-4 hours
tutorial
advanced

Harvard Annotated Transformer

4-5 hours

3. Large Language Models

Understand how LLMs work, from pretraining to inference, and learn to work with them effectively.

Key Topics:

  • Language modeling: next-token prediction
  • Causal (GPT) vs masked (BERT) models
  • Scaling laws and emergent capabilities
  • Tokenizers: BPE, WordPiece, SentencePiece
  • Working with HuggingFace Transformers
  • Prompt engineering fundamentals
  • Fine-tuning basics: LoRA and PEFT concepts

Action Items:

  • Follow Karpathy's nanoGPT to build a mini LLM
  • Run inference with various HuggingFace models
  • Design effective prompts for different tasks
  • Fine-tune a small model with LoRA
course
intermediate

Let's Build GPT (Karpathy)

4-5 hours
course
intermediate

HuggingFace NLP Course (Ch 1-4)

6-8 hours
paper
advanced

LoRA Paper

2-3 hours

Diffusion Models Foundations

4. Generative Model Foundations

Build intuition for generative modeling before diving into diffusion-specific concepts.

Key Topics:

  • Generative vs discriminative models
  • Latent space and representation learning
  • Variational Autoencoders (VAE) intuition
  • GANs overview: generator, discriminator, adversarial training
  • Limitations of VAEs and GANs leading to diffusion

Action Items:

  • Implement a simple VAE on MNIST
  • Explore latent space interpolation
  • Understand mode collapse in GANs
course
intermediate

Generating Sound with Neural Networks (VAE Explained)

3-4 hours
tutorial
intermediate

VAE Tutorial (UvA Deep Learning)

3-4 hours
course
intermediate

Generative Adversarial Networks (GANs)

4-5 hours

5. Diffusion Model Theory

Understand the mathematical foundations of denoising diffusion probabilistic models.

Key Topics:

  • Forward process: adding noise progressively
  • Reverse process: learning to denoise
  • Noise schedules and variance
  • Training objective: simplified loss function
  • Score matching and score-based models connection
  • Sampling: DDPM vs DDIM

Action Items:

  • Implement forward diffusion process
  • Train a simple DDPM on a 2D dataset or MNIST
  • Experiment with different noise schedules
course
beginner

How AI Images Actually Work

1-2 hours
course
intermediate

HuggingFace Diffusion Course (Unit 1-2)

4-5 hours
course
advanced

A Practical Introduction to Diffusion Models

6-8 hours

6. Modern Diffusion Architectures

Learn the architectures powering Stable Diffusion, DALL-E, and other state-of-the-art systems.

Key Topics:

  • Stable Diffusion architecture
  • U-Net for denoising
  • Text conditioning with CLIP
  • ControlNet for image conditioning
  • Practical generation techniques

Action Items:

  • Run Stable Diffusion with different prompts and settings
  • Experiment with guidance scale effects
  • Use ControlNet for structured generation
  • Fine-tune with LoRA on custom concepts
course
intermediate

The U-Net Explained

1-2 hours
blog
intermediate

The Illustrated Stable Diffusion

2-3 hours
course
intermediate

Stable Diffusion and ControlNet

3-4 hours
tutorial
intermediate

ControlNet Tutorial (HuggingFace)

2-3 hours
tutorial
advanced

LoRA Training for Diffusion

2-3 hours

Speech Technologies

7. Audio Fundamentals

Build essential knowledge of audio signal processing for speech applications.

Key Topics:

  • Digital audio: sampling rate, bit depth
  • Time vs frequency domain representations
  • Spectrograms and Short-Time Fourier Transform
  • Mel scale and mel spectrograms
  • MFCCs for speech feature extraction
  • Audio preprocessing: normalization, VAD

Action Items:

  • Load and visualize audio with librosa
  • Generate spectrograms and mel spectrograms
  • Extract MFCCs from speech samples
  • Build a basic preprocessing pipeline
course
beginner

HuggingFace Audio Course

4-5 hours
course
intermediate

Audio Signal Processing for Machine Learning

6-8 hours

8. Speech-to-Text (ASR)

Learn automatic speech recognition from fundamentals to modern end-to-end models.

Key Topics:

  • ASR pipeline overview
  • CTC loss and alignment-free training
  • End-to-end ASR architectures
  • Whisper: architecture, capabilities, multilingual
  • Streaming vs offline recognition
  • Evaluation metrics: WER, CER
  • Domain adaptation and fine-tuning

Action Items:

  • Set up Whisper transcription pipeline
  • Compare model sizes on accuracy/speed
  • Test on accented and noisy audio
  • Build a real-time transcription demo
tutorial
intermediate

HuggingFace ASR Tutorial (Unit 5)

3-4 hours
course
intermediate

End-to-End ASR System

1-2 hours
course
intermediate

Whisper Paper Explanation

1-2 hours

9. Text-to-Speech (TTS)

Explore speech synthesis from text normalization to neural vocoders.

Key Topics:

  • TTS pipeline: text → acoustic features → waveform
  • Text normalization and G2P conversion
  • Acoustic models: Tacotron, FastSpeech
  • Neural vocoders: WaveNet, HiFi-GAN
  • Modern TTS: VITS, Bark, Coqui TTS
  • Prosody control and expressive speech
  • Voice cloning with minimal samples

Action Items:

  • Run inference with Coqui TTS / Bark
  • Compare quality across TTS models
  • Experiment with prosody and emotion
  • Try zero-shot voice cloning
tutorial
intermediate

HuggingFace TTS Tutorial (Unit 6)

3-4 hours
course
intermediate

TTS Course

4-5 hours

Capstone Project

Build an automated video translation pipeline that combines ASR, LLM translation, and TTS to dub English videos into Russian with synchronized audio.