- What is Database Normalization
- Understanding Normal Forms
- Step-by-step Example with Full Explanation in 1NF, 2NF, 3NF, and BCNF
- Summary & Key Takeaway Notes
What is Database Normalization
The concept of normalization was introduced by E.F. Codd, the inventor of the relational database model. Normalization is the process of organizing data in a database to reduce redundancy, improve data integrity, and optimize data storage and retrieval. It involves breaking down a database into multiple related tables and defining relationships between them. The purpose of normalization in SQL is to eliminate redundant (repetitive) data and ensure data is stored logically.
Data redundancy refers to the duplication of data within a database.
Data integrity refers to the accuracy, consistency, and reliability of data. It ensures that data is correct and valid based on defined rules and constraints.
Understanding Normal Forms
Normal forms are a set of guidelines or rules used in database design to ensure data integrity, eliminate redundancy, and improve efficiency in database systems.
The normal forms (from least normalized to most normalized) are:
- UNF: Unnormalized form
- 1NF: First normal form
- 2NF: Second normal form
- 3NF: Third normal form
- EKNF: Elementary key normal form
- BCNF: Boyce–Codd normal form
- 4NF: Fourth normal form
- ETNF: Essential tuple normal form
- 5NF: Fifth normal form
- DKNF: Domain-key normal form
- 6NF: Sixth normal form
A Step by step Example in 1NF, 2NF and 3NF with Full Explanation
First Normal Form (1NF)
- Each table cell should contain a single value.
- Each record needs to be unique.
To conform to the 1NF, we need to ensure that each field in the table contains only one value from its attribute domain. This means that each attribute should hold a single value, and if there are repeating groups of attributes (e.g., multiple courses for a single student), we need to eliminate them by creating separate tables.
Consider the following case: In the example table ‘Student’, there are multiple courses for a single student, so we need to move these rows into new rows or create a separate table.
Second Normal Form (2NF)
- The tables are in 1NF
- There is no partial dependencies
To make the database into 2NF, it requires that Single Column Primary Key that does not functionally dependent on any subset of candidate key relation. To conform to 2NF and remove duplicities, every non-candidate-key attribute must depend on the whole candidate key, not just part of it.
2NF helps in reducing redundancy and ensuring data integrity by organizing the data into separate tables based on functional dependencies. By eliminating partial dependencies, we ensure that each attribute in the table is uniquely determined by the primary key.
In the example case: Since ‘StudentName’ and ‘Address’ is the composite key, there is no a Single Column Primary Key in the table yet. So adding a separate column for ‘StudentID’ to achieve the second normal form (2NF) in this case. In addition, we divide the original table into 2 tables, where Table 1 contains student information and Table 2 contains information on courses taken. By splitting the original table, we ensure that each table represents a single entity and there are no partial dependencies.
Third Normal Form (3NF)
- The tables are in 2NF
- There is no transitive dependencies
To bring the table to 3NF, we need to eliminate any transitive dependencies. In this case, we can identify that changing the non-key column StudentName may change prefix. To normalize the table to 3NF, we can decompose the prefix into a separate table with prefixID. Hence, there are no transitive functional dependencies.
Updated: The Prefix Table should include all possible Prefix.
BCNF (Boyce-Codd Normal Form)
- The tables are in 3NF
- X should be a superkey for every functional dependency (FD) X−>Y in a given relation.
- There should be no non-trivial functional dependencies of non-prime attributes on candidate keys.
Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more than one Candidate Key.
In the example case we had for 3NF:
For table 1, we have StudentID → StudentName, Address, prefixID
For table 2, we have StudentID → Course
For table 3, we have prefixID → prefix
In Table 1, all attributes (StudentName, Address, prefixID) are dependent on the primary key StudentID, and there are no non-trivial functional dependencies.
In Table 2, the Course attribute is dependent on the primary key StudentID, and there are no non-trivial functional dependencies.
In Table 3, the prefix attribute is dependent on the primary key prefixID, and there are no non-trivial functional dependencies.
4NF (Fourth Normal Form) Rules
- The tables are in 3NF
- There is no database table instance contains two or more, independent and multivalued data describing the relevant entity
5NF (Fifth Normal Form) Rules
- The tables are in 4NF
- It cannot be decomposed into any number of smaller tables without loss of data
6NF (Sixth Normal Form) Rules
- The tables are in 5NF
Satisfying 6NF A simple and intuitive definition of the sixth normal form is that “a table is in 6NF when the row contains the Primary Key, and at most one other attribute”.
Therefore, If a table in 5NF has one primary key column and N attributes, representing the same information in 6NF will require N tables.
Summary
Each normal form builds upon the previous one, with higher normal forms providing more strict guidelines for data organization. By achieving higher normal forms, a database can minimize redundancy, improve data integrity, and facilitate efficient data management and querying.
The choice of which normal form to apply depends on the specific business requirements and logical rules governing the database.
Takeaway Note
1NF: remove duplicate attributes for attribute domain
2NF: remove partial dependencies for data integrity
3NF: remove transitive dependencies
BCNF: remove non-trivial dependencies between candidate key attributes
Partial dependencies: an attribute depends on only part of a multivalued key
Transitive dependency is one attribute in a table depends on the values of two or more other attributes.
Non-trivial dependencies is a direct relationship or dependency between two attributes without involving any other attributes.
Super key is a set of one or more attributes (columns) that can uniquely identify a tuple (row) in a database table.
Candidate key is a specific type of super key that is considered as a potential choice for the primary key of a table. It is a minimal super key, meaning that no subset of the candidate key can uniquely identify records in the table.
Composite key is a key that consists of two or more attributes or columns in a database table. It is used to uniquely identify a record in a table when no single attribute can uniquely identify it.
Data redundancy refers to the duplication of data within a database.
Data integrity refers to the accuracy, consistency, and reliability of data. It ensures that data is correct and valid based on defined rules and constraints.
References: