Database Normalization & Normal Forms

SharkYun
6 min readMay 22, 2023

--

  • What is Database Normalization
  • Understanding Normal Forms
  • Step-by-step Example with Full Explanation in 1NF, 2NF, 3NF, and BCNF
  • Summary & Key Takeaway Notes
Overview of UNF to 3 NF step by step example SharkYun

What is Database Normalization

The concept of normalization was introduced by E.F. Codd, the inventor of the relational database model. Normalization is the process of organizing data in a database to reduce redundancy, improve data integrity, and optimize data storage and retrieval. It involves breaking down a database into multiple related tables and defining relationships between them. The purpose of normalization in SQL is to eliminate redundant (repetitive) data and ensure data is stored logically.

Data redundancy refers to the duplication of data within a database.

Data integrity refers to the accuracy, consistency, and reliability of data. It ensures that data is correct and valid based on defined rules and constraints.

Understanding Normal Forms

Normal forms are a set of guidelines or rules used in database design to ensure data integrity, eliminate redundancy, and improve efficiency in database systems.

The normal forms (from least normalized to most normalized) are:

Picture Source

A Step by step Example in 1NF, 2NF and 3NF with Full Explanation

First Normal Form (1NF)

  • Each table cell should contain a single value.
  • Each record needs to be unique.

To conform to the 1NF, we need to ensure that each field in the table contains only one value from its attribute domain. This means that each attribute should hold a single value, and if there are repeating groups of attributes (e.g., multiple courses for a single student), we need to eliminate them by creating separate tables.

Consider the following case: In the example table ‘Student’, there are multiple courses for a single student, so we need to move these rows into new rows or create a separate table.

First Normal Form SharkYun

Second Normal Form (2NF)

  • The tables are in 1NF
  • There is no partial dependencies

To make the database into 2NF, it requires that Single Column Primary Key that does not functionally dependent on any subset of candidate key relation. To conform to 2NF and remove duplicities, every non-candidate-key attribute must depend on the whole candidate key, not just part of it.

2NF helps in reducing redundancy and ensuring data integrity by organizing the data into separate tables based on functional dependencies. By eliminating partial dependencies, we ensure that each attribute in the table is uniquely determined by the primary key.

In the example case: Since ‘StudentName’ and ‘Address’ is the composite key, there is no a Single Column Primary Key in the table yet. So adding a separate column for ‘StudentID’ to achieve the second normal form (2NF) in this case. In addition, we divide the original table into 2 tables, where Table 1 contains student information and Table 2 contains information on courses taken. By splitting the original table, we ensure that each table represents a single entity and there are no partial dependencies.

Second Normal Form SharkYun

Third Normal Form (3NF)

  • The tables are in 2NF
  • There is no transitive dependencies

To bring the table to 3NF, we need to eliminate any transitive dependencies. In this case, we can identify that changing the non-key column StudentName may change prefix. To normalize the table to 3NF, we can decompose the prefix into a separate table with prefixID. Hence, there are no transitive functional dependencies.

Third Normal Form SharkYun
Correction for Prefix table

Updated: The Prefix Table should include all possible Prefix.

BCNF (Boyce-Codd Normal Form)

  • The tables are in 3NF
  • X should be a superkey for every functional dependency (FD) X−>Y in a given relation.
  • There should be no non-trivial functional dependencies of non-prime attributes on candidate keys.

Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more than one Candidate Key.

In the example case we had for 3NF:

For table 1, we have StudentID → StudentName, Address, prefixID

For table 2, we have StudentID → Course

For table 3, we have prefixID → prefix

In Table 1, all attributes (StudentName, Address, prefixID) are dependent on the primary key StudentID, and there are no non-trivial functional dependencies.

In Table 2, the Course attribute is dependent on the primary key StudentID, and there are no non-trivial functional dependencies.

In Table 3, the prefix attribute is dependent on the primary key prefixID, and there are no non-trivial functional dependencies.

4NF (Fourth Normal Form) Rules

  • The tables are in 3NF
  • There is no database table instance contains two or more, independent and multivalued data describing the relevant entity

5NF (Fifth Normal Form) Rules

  • The tables are in 4NF
  • It cannot be decomposed into any number of smaller tables without loss of data

6NF (Sixth Normal Form) Rules

  • The tables are in 5NF

Satisfying 6NF A simple and intuitive definition of the sixth normal form is that “a table is in 6NF when the row contains the Primary Key, and at most one other attribute”.

Therefore, If a table in 5NF has one primary key column and N attributes, representing the same information in 6NF will require N tables.

Summary

Each normal form builds upon the previous one, with higher normal forms providing more strict guidelines for data organization. By achieving higher normal forms, a database can minimize redundancy, improve data integrity, and facilitate efficient data management and querying.

The choice of which normal form to apply depends on the specific business requirements and logical rules governing the database.

Takeaway Note

1NF: remove duplicate attributes for attribute domain

2NF: remove partial dependencies for data integrity

3NF: remove transitive dependencies

BCNF: remove non-trivial dependencies between candidate key attributes

Partial dependencies: an attribute depends on only part of a multivalued key

Transitive dependency is one attribute in a table depends on the values of two or more other attributes.

Non-trivial dependencies is a direct relationship or dependency between two attributes without involving any other attributes.

Super key is a set of one or more attributes (columns) that can uniquely identify a tuple (row) in a database table.

Candidate key is a specific type of super key that is considered as a potential choice for the primary key of a table. It is a minimal super key, meaning that no subset of the candidate key can uniquely identify records in the table.

Composite key is a key that consists of two or more attributes or columns in a database table. It is used to uniquely identify a record in a table when no single attribute can uniquely identify it.

Data redundancy refers to the duplication of data within a database.

Data integrity refers to the accuracy, consistency, and reliability of data. It ensures that data is correct and valid based on defined rules and constraints.

References:

--

--

SharkYun

Data science notes and Personal experiences | UCLA 2023'