Best practice of the database design for OLTP (transactional) databases is to have all data at least in the 3rd normal form. After some hands-on experience developers design tables right in the 3RD NF. But in case of any difficulties, it can be done iteration by iteration via following the rules:
1st Normal Form
First normal form (1NF) sets the very basic rules for an organized database:
- Define the data items required, because they become the columns in a table. Place related data items in a table.
- Ensure that there are no repeating groups of data.
- Ensure that there is a primary key.
First Rule of 1NF:
Defining the data items means looking at the data to be stored, organizing the data into columns, defining what type of data each column contains, and finally putting related columns into their own table. For example, you put all the columns relating to locations of meetings in the “location” table, those relating to members in the “memberDetails” table, and so on.
Second Rule of 1NF:
The next step is ensuring that there are no repeating groups of data. Consider we have the following table:
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR (20) NOT NULL, customerAge INT NOT NULL, customerAddress VARCHAR (25), orders VARCHAR(155) );
So if we populate this table for a single customer having multiple orders, then it would be something as follows:
|100||Sachin||36||Lower West Side||Cannon XL-200|
|100||Sachin||36||Lower West Side||Battery XL-200|
|100||Sachin||36||Lower West Side||Tripod Large|
But as per 1NF, we need to ensure that there are no repeating groups of data. So let us break above table into two parts and join them using a key as follows:
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR (20) NOT NULL, customerAge INT NOT NULL, customerAddress CHAR (25), PRIMARY KEY (customerId) );
This table would have the following record:
|100||Sachin||36||Lower West Side|
CREATE TABLE orders ( orderId INT NOT NULL, customerId INT NOT NULL, orders VARCHAR(155), PRIMARY KEY (orderId) );
This table would have the following records:
Third Rule of 1NF:
The final rule of the first normal form, create a primary key for each table which has been created.
2nd Normal Form
Second normal form states that it should meet all the rules for 1NF and there must be no partial dependencies of any of the columns on the primary key.
Consider a customer-order relation and you want to store customer ID, customer name, order ID and order detail, and date of purchase:
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR (20) NOT NULL, orderId INT NOT NULL, orderDetail VARCHAR (20) NOT NULL, saleDate DATETIME, PRIMARY KEY (customerId, orderId) );
This table is in first normal form, in that it obeys all the rules of first normal form. In this table, the primary key consists of customerId and orderId. Combined, they are unique assuming same customer would hardly order same thing.
However, the table is not in second normal form because there are partial dependencies of primary keys and columns. customerName is dependent on customerId, and there’s no real link between a customer’s name and what he purchased. Order detail and purchase date are also dependent on orderId, but they are not dependent on customerId, because there’s no link between a customerId and an orderDetail or their saleDate.
To make this table comply with second normal form, you need to separate the columns into three tables.
First, create a table to store the customer details as follows:
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR(20) NOT NULL, PRIMARY KEY (customerId) );
Next, create a table to store details of each order:
CREATE TABLE orders ( orderId INT NOT NULL, orderDetail VARCHAR(20) NOT NULL, PRIMARY KEY (orderId) );
Finally, create a third table storing just customerId and orderId to keep track of all the orders for a customer:
CREATE TABLE customerOrders ( customerId INT NOT NULL, orderId INT NOT NULL, saleDate DATETIME, PRIMARY KEY (customerId, orderId) );
3rd Normal Form
A table is in third normal form when the following conditions are met:
- It is in second normal form.
- All non-primary fields are dependent on the primary key.
The dependency of non-primary fields is between the data. For example, in the below table, street name, city, and state are unbreakably bound to the zip code.
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR (20) NOT NULL, DOB DATE, street VARCHAR(200), city VARCHAR(100), state VARCHAR(100), zip VARCHAR(12), emailId VARCHAR(256), PRIMARY KEY (customerId) );
The dependency between zip code and address is called a transitive dependency. To comply with third normal form, all you need to do is move the street, city, and state fields into their own table, which you can call the address table:
CREATE TABLE address ( addressId INT, zip VARCHAR(12), street VARCHAR(200), city VARCHAR(100), state VARCHAR(100), PRIMARY KEY (addressId) );
Next, alter the customers table as follows:
CREATE TABLE customers ( customerId INT NOT NULL, customerName VARCHAR (20) NOT NULL, DOB DATE, addressId INT, emailId VARCHAR(256), PRIMARY KEY (customerId) );
The advantages of removing transitive dependencies are mainly twofold.
- The amount of data duplication is reduced and therefore your database becomes smaller.
- Data integrity. When duplicated data changes, there’s a big risk of updating only some of the data, especially if it’s spread out in a number of different places in the database. For example, if address and zip code data were stored in three or four different tables, then any changes in zip codes would need to ripple out to every record in those three or four tables.
Advantages of data normalization
Removed data redundancy
The higher normalization level, the less data is repeatedly saved in the database. Once data duplicates are removed, it is easier to change the data since data is present in only one place.
Better understanding of the business data model
To put entities in order requires quite an effort of communicating with users, managers, data owners and other business people. After all business requirements collected, analyzed and prioritized, the clean and sound data model built upon them always pays off.
Reduced data modification anomalies
Normalization splits an entity into a number of smaller entities, so you will have more smaller tables, easier for sorting, indexing and searching. A table can have only one clustered index, therefore the more tables, the more clustered indexes you can have, which helps you to keep data sorted, simplifies and speeds up access to it.
Widely spread myths about disadvantages of data normalization
Multiple smaller tables require more complex joins in between those tables, which consumes more development time
We suggest to hire developers who don’t say this on the interview and demonstrate clear understanding of the importance of keeping data clean and sound. A little additional complexity of the code once in a while (normally hidden completely from an end-user) cannot be a reason to mess with data the company constantly works with. Databases tend to grow, and at some point in-normalized data becomes an application ceiling. We saw a case when changing user’s last name after marriage required full scan of multiple columns in approximately 700 tables and replacements in approximately 8000 places, which overall took about 4 hours. If the data was properly normalized, only one replacement in one place in one table would be required.
Complex joins decrease performance
It may happen for sure, but only if the join built incorrectly or even if it’s being used in situation where it’s not required at all. We saw a case when developer built a very creative super-complex query with CASE operator and 12 levels of self-joins of a table with a purpose of pre-populating another table with 12 similar columns. Query worked while the table contained only 5 records. When it grew to hundred the query simple was killing the server every time because the dynamic data set was eating up all available memory and disk resources. Rewriting the query to use 11 UNION operators instead of 12 joins solved the problem. If we are talking about the same data to be returned from the database to the user, why SELECT statement with joins would be less effective that without? The same amount of data takes exactly the same amount of memory, and exactly the same time is required to transfer it from disk. As a matter of fact, denormalized data probably would take significantly more time to transfer it from disk because of duplicates.
Complete normalization of a database requires clear and broad understanding of the business, it takes more time to analyze and understand the business
This is 100% true, but better understanding of business by developers always pays off, so it’s not actually a disadvantage, it’s just a price to pay for a good data model.
Denormalization should only be considered in special circumstances, like data warehousing. A properly normalized OLTP database will outperform a denormalized one, but some other operations with the database, like business analytics, can be much handier if the number of tables is minimized, and the remaining tables are organized into a specific structure with fact and dimension tables clearly separated from each other. For this purpose data denormalization may be required.
Published at WebProfIT.Consulting