You log into Netflix. It remembers what you watched. Suggests new shows based on your preferences. Picks up exactly where you left off.
You check your bank account. Every transaction is there. Going back years. Perfectly accurate. Never lost.
You search for a product on Amazon. Millions of items. Thousands match your query. Results appear in milliseconds.
How?
Data. Massive amounts of data. Organized. Searchable. Reliable.
But it's not just stored randomly in files somewhere. It's in databases—carefully designed systems that store, organize, and retrieve information efficiently.
Understanding databases changes everything. You start to see why some apps are fast and others slow. Why good database design matters. Why companies pay database administrators well. Why data integrity is crucial.
This isn't just theory for computer scientists. This is the foundation of every modern application—from social media to banking to healthcare to e-commerce.
Let's explore how data is really organized.
Definition: A database schema is the structure that defines how data is organized in a database, including tables, fields, relationships, constraints, and data types.
Think of a schema as a blueprint. Before you build a house, you need blueprints showing where rooms go, how they connect, what materials to use.
Before you build a database, you need a schema showing what data you'll store, how it's organized, how pieces relate to each other.
Students table:
| Field | Type | Constraint |
|---|---|---|
| StudentID | integer | primary key |
| FirstName | text | |
| LastName | text | |
| DateOfBirth | date | |
| text |
Courses table:
| Field | Type | Constraint |
|---|---|---|
| CourseID | integer | primary key |
| CourseName | text | |
| Credits | integer | |
| Department | text |
Enrollments table:
| Field | Type | Constraint |
|---|---|---|
| EnrollmentID | integer | primary key |
| StudentID | integer | foreign key |
| CourseID | integer | foreign key |
| Grade | text | |
| Semester | text |
This schema shows three tables and how they relate. Students can enroll in courses. The Enrollments table connects them.
Definition: A relational database is a database that organizes data into tables (relations) with rows and columns, where relationships between tables are established through keys.
Invented by Edgar F. Codd in 1970. Changed everything about data storage.
Each table stores one type of entity. Students table stores students. Courses table stores courses. Not: Students_and_Courses_and_Everything_Else table. This separation is crucial. We'll see why when we discuss normalization.
Definition: A primary key is a field (or combination of fields) that uniquely identifies each record in a table.
Rules:
Example: In the Students table, StudentID is the primary key.
Every student has a unique StudentID. No two students share an ID. If StudentID = 12345,
there's exactly one student with that ID.
Why not use names? Two students might have the same name. Names can change (marriage). IDs don't change.
Definition: A foreign key is a field in one table that refers to the primary key of another table, creating a relationship between the tables.
Example: The Enrollments table has StudentID as a foreign key.
It references the primary key StudentID in the Students table. This says: "This
enrollment belongs to this specific student."
Referential integrity: The database ensures foreign keys always point to valid records. You can't enroll a student with ID 99999 if no such student exists.
One-to-One (1:1): One record in Table A relates to exactly one record in Table B. Example: Each employee has one passport. Each passport belongs to one employee.
One-to-Many (1:M): One record in Table A relates to many records in Table B. Example: One teacher teaches many courses. Each course has one teacher.
Many-to-Many (M:N): Many records in Table A relate to many records in Table B. Example: Students enroll in many courses. Courses have many students. Implementation: Requires a junction table (like Enrollments). Can't directly represent M:N in relational databases.
Definition: An Entity-Relationship Diagram is a visual representation of entities (tables), their attributes (fields), and the relationships between them.
Before writing code or creating tables, design the database visually.
Entities: Represented as rectangles. These become tables.
Attributes: Represented as ovals connected to entities. These become fields.
Relationships: Represented as diamonds connecting entities.
Cardinality: Shows relationship type (1:1, 1:M, M:N).
Crow's foot notation (common):
This shows:
Definition: Normalization is the process of organizing database tables to minimize redundancy and dependency, improving data integrity and reducing anomalies.
Problems with bad design:
Example of bad design — Students_and_Courses table:
| StudentID | StudentName | Course1 | Course2 | Course3 |
|---|---|---|---|---|
| 1 | Alice | Math | Physics | NULL |
| 2 | Bob | Math | Biology | Chemistry |
Problems:
Definition: A table is in 1NF if each field contains only atomic (indivisible) values and each record is unique.
Rules:
Bad (not 1NF):
| StudentID | StudentName | Courses |
|---|---|---|
| 1 | Alice | Math, Physics |
| 2 | Bob | Math, Bio, Chem |
"Courses" column contains multiple values. Violates 1NF.
Good (1NF):
| StudentID | StudentName | Course |
|---|---|---|
| 1 | Alice | Math |
| 1 | Alice | Physics |
| 2 | Bob | Math |
| 2 | Bob | Biology |
| 2 | Bob | Chemistry |
Each cell has a single value. Each row is a unique combination of student and course. Problem: Still redundant. "Alice" appears twice. If she changes name, update multiple rows.
Definition: A table is in 2NF if it's in 1NF and all non-key attributes are fully dependent on the entire primary key (no partial dependencies).
Only relevant for composite primary keys (keys made of multiple fields).
Example with composite key — Composite primary key: (StudentID, CourseID)
Bad (1NF but not 2NF):
| StudentID | CourseID | StudentName | CourseName | Grade |
|---|---|---|---|---|
| 1 | 101 | Alice | Math | A |
| 1 | 102 | Alice | Physics | B |
| 2 | 101 | Bob | Math | A |
Problems:
This is partial dependency—some attributes depend on part of the key, not the whole key.
Good (2NF) — Split into three tables:
Students:
| StudentID | StudentName |
|---|---|
| 1 | Alice |
| 2 | Bob |
Courses:
| CourseID | CourseName |
|---|---|
| 101 | Math |
| 102 | Physics |
Enrollments:
| StudentID | CourseID | Grade |
|---|---|---|
| 1 | 101 | A |
| 1 | 102 | B |
| 2 | 101 | A |
Now all non-key attributes fully depend on the entire primary key. Problem: Still possible redundancy if non-key attributes depend on other non-key attributes.
Definition: A table is in 3NF if it's in 2NF and no non-key attribute depends on another non-key attribute (no transitive dependencies).
Bad (2NF but not 3NF):
| StudentID | StudentName | AdvisorID | AdvisorName |
|---|---|---|---|
| 1 | Alice | 501 | Dr. Smith |
| 2 | Bob | 501 | Dr. Smith |
| 3 | Carol | 502 | Dr. Jones |
Primary key: StudentID. Problem: AdvisorName depends on AdvisorID (not on StudentID directly). This is transitive dependency: StudentID → AdvisorID → AdvisorName. If Dr. Smith changes name, must update multiple rows.
Good (3NF) — Split into two tables:
Students:
| StudentID | StudentName | AdvisorID |
|---|---|---|
| 1 | Alice | 501 |
| 2 | Bob | 501 |
| 3 | Carol | 502 |
Advisors:
| AdvisorID | AdvisorName |
|---|---|
| 501 | Dr. Smith |
| 502 | Dr. Jones |
Now if Dr. Smith changes name, update one row in the Advisors table. Students table doesn't change.
Benefits of Normalization:
When NOT to Normalize: Sometimes denormalization (intentionally breaking normalization rules) improves performance. Example: Data warehouses often denormalize for faster queries. Reading from one table is faster than joining many tables. Trade-off: Storage space and update complexity vs. query speed.
Definition: SQL (Structured Query Language) is a standardized programming language used to manage and manipulate relational databases.
SQL lets you: create tables, insert data, update data, delete data, and query data.
Basic syntax:
SELECT column1, column2
FROM table_name
WHERE condition;
Get all students:
SELECT * FROM Students;
Get specific columns:
SELECT FirstName, LastName FROM Students;
Filter with WHERE:
SELECT * FROM Students
WHERE LastName = 'Smith';
Multiple conditions:
SELECT * FROM Students
WHERE LastName = 'Smith' AND FirstName = 'John';
Pattern matching with LIKE:
SELECT * FROM Students
WHERE Email LIKE '%@gmail.com';
Sorting with ORDER BY:
SELECT * FROM Students
ORDER BY LastName ASC, FirstName ASC;
Limiting results:
SELECT * FROM Students
LIMIT 10;
Syntax:
INSERT INTO table_name (column1, column2)
VALUES (value1, value2);
Example:
INSERT INTO Students (StudentID, FirstName, LastName, Email)
VALUES (1, 'Alice', 'Johnson', 'alice@email.com');
Multiple rows:
INSERT INTO Students (StudentID, FirstName, LastName)
VALUES
(2, 'Bob', 'Smith'),
(3, 'Carol', 'Williams'),
(4, 'David', 'Brown');
Syntax:
UPDATE table_name
SET column1 = value1, column2 = value2
WHERE condition;
Example:
UPDATE Students
SET Email = 'new.email@gmail.com'
WHERE StudentID = 1;
-- DANGEROUS! Updates every studentUPDATE Students SET Email = 'same@email.com';
Syntax:
DELETE FROM table_name
WHERE condition;
Example:
DELETE FROM Students
WHERE StudentID = 1;
-- DANGEROUS! Deletes all studentsDELETE FROM Students;
Definition: A join is an SQL operation that combines rows from two or more tables based on a related column.
Why needed? Data is normalized across multiple tables. Joins bring it back together.
Definition: Returns only rows where there's a match in both tables.
SELECT columns
FROM table1
INNER JOIN table2
ON table1.column = table2.column;
Example — Students and their enrollments:
SELECT Students.FirstName, Students.LastName, Courses.CourseName
FROM Students
INNER JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
INNER JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
Result: Only students who are enrolled in courses.
Definition: Returns all rows from the left table and matching rows from the right table. If no match, NULL for right table columns.
SELECT Students.FirstName, Courses.CourseName
FROM Students
LEFT JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
LEFT JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
Result: All students. For students with no enrollments, CourseName shows NULL.
Definition: Returns all rows from the right table and matching rows from the left table. Less commonly used (can rewrite as LEFT JOIN by swapping tables).
Definition: Returns all rows from both tables. If no match, NULL for missing columns.
SELECT Students.FirstName, Courses.CourseName
FROM Students
FULL OUTER JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
FULL OUTER JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
Definition: Aggregate functions perform calculations on multiple rows and return a single value.
Counts number of rows.
SELECT COUNT(*) FROM Students;
-- Returns total number of students
SELECT COUNT(*) FROM Students
WHERE LastName = 'Smith';
-- Returns number of students with last name Smith
Adds up values.
SELECT SUM(Credits) FROM Enrollments
WHERE StudentID = 1;
-- Total credits for student 1
Calculates average.
SELECT AVG(Grade) FROM Enrollments
WHERE CourseID = 101;
-- Average grade in course 101
Find minimum and maximum values.
SELECT MIN(DateOfBirth) FROM Students;
-- Oldest student (earliest birth date)
SELECT MAX(DateOfBirth) FROM Students;
-- Youngest student (latest birth date)
Groups rows that have the same values.
-- Count students in each course:
SELECT CourseID, COUNT(*) AS NumStudents
FROM Enrollments
GROUP BY CourseID;
-- Average grade per course:
SELECT Courses.CourseName, AVG(Enrollments.Grade) AS AvgGrade
FROM Enrollments
INNER JOIN Courses ON Enrollments.CourseID = Courses.CourseID
GROUP BY Courses.CourseName;
Filters groups (WHERE filters rows before grouping; HAVING filters groups after).
-- Courses with more than 10 students:
SELECT CourseID, COUNT(*) AS NumStudents
FROM Enrollments
GROUP BY CourseID
HAVING COUNT(*) > 10;
Definition: A view is a virtual table based on a SQL query that can be queried like a regular table but doesn't store data itself.
Why use views?
CREATE VIEW StudentCourses AS
SELECT Students.FirstName, Students.LastName, Courses.CourseName
FROM Students
INNER JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
INNER JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
SELECT * FROM StudentCourses
WHERE LastName = 'Smith';
Much simpler than writing the full join query every time!
Definition: A transaction is a sequence of database operations treated as a single unit of work that either completes entirely or has no effect at all.
Transfer $100 from Account A to Account B.
BEGIN TRANSACTION;
UPDATE Accounts
SET Balance = Balance - 100
WHERE AccountID = 'A';
UPDATE Accounts
SET Balance = Balance + 100
WHERE AccountID = 'B';
COMMIT;
If anything goes wrong (power failure, error in second UPDATE), the entire transaction rolls back. Account A doesn't lose $100 without Account B gaining it.
BEGIN TRANSACTION — Start transactionCOMMIT — Make changes permanentROLLBACK — Undo all changes in transactionBEGIN TRANSACTION;
UPDATE Accounts SET Balance = Balance - 100 WHERE AccountID = 'A';
-- Check if sufficient funds
IF (SELECT Balance FROM Accounts WHERE AccountID = 'A') < 0 THEN
ROLLBACK;
-- Transaction cancelled
ELSE
UPDATE Accounts SET Balance = Balance + 100 WHERE AccountID = 'B';
COMMIT;
-- Transaction successful
END IF;
Definition: A data warehouse is a large, centralized repository of integrated data from multiple sources, optimized for analysis and reporting rather than transaction processing.
Online Transaction Processing. Handle day-to-day operations.
Online Analytical Processing. Handle analysis and reporting.
Definition: A star schema is a data warehouse design where a central fact table connects to multiple dimension tables, forming a star-like structure.
Example: Retail Sales Warehouse
Fact table — Sales:
| SaleID | ProductID | CustomerID | StoreID | DateID | Quantity | Revenue |
|---|---|---|---|---|---|---|
| … fact rows … | ||||||
Dimension tables:
| ProductID | ProductName | Category | Price |
|---|---|---|---|
| Product dimension | |||
| CustomerID | Name | City | Country |
|---|---|---|---|
| Customer dimension | |||
| StoreID | StoreName | City | Region |
|---|---|---|---|
| Store dimension | |||
| DateID | Date | Month | Quarter | Year |
|---|---|---|---|---|
| Date dimension | ||||
Query example — Total sales by category per month:
SELECT Product.Category, Date.Month, SUM(Sales.Revenue)
FROM Sales
JOIN Product ON Sales.ProductID = Product.ProductID
JOIN Date ON Sales.DateID = Date.DateID
GROUP BY Product.Category, Date.Month;
Why Denormalize in Warehouses? Joins are expensive. Complex analytics query many tables. Denormalizing—storing redundant data—makes queries faster. Trade-off: More storage space, harder to update vs. much faster queries. For warehouses: Fast queries matter more. Data changes infrequently (loaded in batches). Storage is cheap.
You started wondering where your data actually lives.
Now you understand.
Database schemas define structure—tables, fields, relationships, constraints. The blueprint before you build.
Relational databases organize data into tables connected by keys—primary keys uniquely identify records, foreign keys create relationships.
ERD diagrams visualize database design before implementation, showing entities, attributes, and relationships clearly.
Normalization eliminates redundancy and anomalies—1NF removes repeating groups, 2NF eliminates partial dependencies, 3NF removes transitive dependencies.
SQL is the language—SELECT retrieves data, INSERT adds it, UPDATE modifies it, DELETE removes it.
Joins combine tables—INNER JOIN for matches, LEFT JOIN for all from left table, joins reconstruct normalized data.
Aggregate functions summarize data—COUNT, SUM, AVG, MIN, MAX with GROUP BY and HAVING.
Views simplify complex queries, providing virtual tables without storing duplicate data.
Transactions ensure data integrity with ACID properties—all operations succeed or all fail, maintaining consistency.
Data warehouses support analytics, using denormalized star schemas for fast complex queries on historical data.
Every time you shop online, check your bank account, or search for information, databases work behind the scenes. Organized. Reliable. Fast.
Understanding databases changes how you think about data. You're no longer just a user. You understand the architecture beneath.