Module 4: Data Manipulation

Data Manipulation - Python Code and Data Pipelines

Goal

Manipulate and prepare real-world data for analysis and machine learning.

This page focuses on the practical skills needed to turn raw, messy data into clean, structured inputs that models and analyses can actually use.

Estimated Time Impact

10-14 hours total

Pandas(8h)

SQL(6h)

Git(4h)

1. Pandas & Data Cleaning

course

beginner

Kaggle: Pandas Micro-Course

6-8 hours

Why it matters:

Most real-world datasets are incomplete, inconsistent, and poorly formatted. Before any modeling or analysis can happen, data must be inspected, cleaned, and transformed.

Pandas is the primary tool used in Python for working with tabular data at scale, and it forms the backbone of most data pipelines.

What to expect:

Learners will:

Load and explore large CSV and tabular datasets
Filter, sort, and aggregate data efficiently
Handle missing values and inconsistent formats
Create new features from existing columns
Apply vectorized operations for performance

By the end of this section, learners will be able to clean and manipulate real-world datasets with concise, readable code.

2. SQL

tutorial

beginner

SQL Tutorial - Essential SQL For The Beginners

8-10 hours

Why it matters:

Most production data lives in relational databases. SQL is the universal language for querying, filtering, aggregating, and transforming data at scale.

Whether extracting features for machine learning or generating reports, SQL fluency is essential for working with real-world data systems.

What to expect:

Learners will:

Write SELECT queries to retrieve and filter data
Use JOIN operations to combine data from multiple tables
Perform aggregations with GROUP BY and HAVING
Create, update, and delete records with INSERT, UPDATE, DELETE
Work with subqueries and common table expressions (CTEs)
Understand indexing and query optimization basics

By completing this section, learners will be able to extract and manipulate data directly from databases used in production environments.

3. Software Engineering Tools

practice

beginner

Git Branching Game

2-3 hours

Why it matters:

Data work is collaborative and iterative. Without version control, experiments are lost, results are overwritten, and collaboration breaks down.

Git is the industry-standard tool for tracking changes, experimenting safely, and working in teams.

What to expect:

Learners will:

Understand repositories, commits, and branches
Save and version their work using git
Push and pull changes from remote repositories
Merge branches and resolve simple conflicts
Develop safe workflows for experimentation

This section ensures learners can manage their code and data work reliably, both individually and in team settings.

Completion Checklist

Completed the Kaggle: Pandas Micro-Course, including all core lessons and exercises, and earned the official course certificate from Kaggle.
Completed the SQL Tutorial - Essential SQL For The Beginners (SQL Tutorial) and practiced all query types including JOINs, aggregations, and subqueries.
Completed the W3Docs Git Basics Quiz (Git Basics Quiz) with a score of at least 85% and earned the Git certificate.

Goal​

Estimated Time Impact

1. Pandas & Data Cleaning​

Kaggle: Pandas Micro-Course

2. SQL​

SQL Tutorial - Essential SQL For The Beginners

3. Software Engineering Tools​

Git Branching Game

Completion Checklist​

Goal

1. Pandas & Data Cleaning

2. SQL

3. Software Engineering Tools

Completion Checklist