Data warehouse modelling with Snowflake Schema

 ABSTRACT : 

Snowflake is a data warehouse schema design where dimension tables are normalized on top of a star schema design. Snowflake schema is generally not recommended due to its performance overhead in joining the normalized dimensions tables. However, the snowflake schema can be extended in a way to improve performance for business analysis activities in business analysis, two environments are complementary and work together to provide effective business analytics. Firstly, the data warehouse environment transforms operational data into information. Secondly, the analytical environment delivers information to end users for further data analysis and decision making. The snowflake schema bridges a gap between two environments. This schema facilitates the mapping of wide dimension structures with many dimensions attributes to analytical processing hierarchies. The snowflake schema makes navigation along hierarchies easier and support flexible analysis such as drill down and rollup. Process changes transactional database into data warehouse could use some schemes, one of them is snowflake scheme. The use of snowflake scheme aims to yield many data those employ as the material of decision-making system. The different of this scheme with other schemes is splitting fact data into many dimensional data. After that, the next process is doing extracting, transformation, and loading (ETL) process. The stage of the population in ETL process has an important role because it determines the result of data those will be loaded to every dimensional data. To guarantee data those will be populated has a good integrity, so it should be done data cleansing namely erasing a noise data. The successful of data populating and loading would be yielded beneficiary of the managerial level to make a decision.

Keywords : Data warehouse, Snowflake design, Business intelligence,

Business analytics.

INTRODUCTION TO SNOWFLAKE SCHEMA :


Fig 1a. Datawarehouse of snowflake

In the information world, the most important asset is data. The existence of data in an organization is used as material to determine policies, build strategies, and take decisions. The available data then processed into information needed by the organization. Data processing could be done in every state, e.g. in the operational database, operational application, or data warehouse technology usage. Many stored data in the operational database are ineffective if it used as material to dig managerial information. The campus library is expected to have a data warehouse so that the process of extracting information to management levels does not interfere with the process of transactions made at any time. Utilization of data warehouse is basically the process that summarizes stored data in the database system done due to a certain time period but it doesn’t do at any time do. The existence of a data warehouse is also required as a space of transaction considering it will extend as increasing of students each year. Supervisors need to get summarized data of transaction data occurred on their departments or known as ‘transactional information system’ (TPS).

At the management information system level, usually, there are summarizing of information resulted by supervisors to synchronize with the supervisors. At the higher level, the information should be presented in such a way to support the decision-making process or this level known as decision support systems. The highest information is necessarily available for library manager needs and the summarized information are much detail in such a way and the availability of information provided in the mechanism named by executive information systems.

History of SnowFlake schema:

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse.

Star Schema

·        Each dimension in a star schema is represented with only one-dimension table.

·        This dimension table contains the set of attributes.

·        The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.



·        There is a fact table at the centre. It contains the keys to each of four dimensions.

·        The fact table also contains the attributes, namely dollars sold and units sold.

Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location key, street, city, province_or_state, country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country.


            Table 1: Difference between star and snowflake schema


What is snowflaking?

The snowflake design is the result of further expansion and normalized of the dimension table. In other words, a dimension table is said to be snow flaked if the low-cardinality attribute of the dimensions have been divided into separate normalized tables. These tables are then joined to the original dimension table with referential constrains(foreign key constrain). Generally, snowflaking is not recommended in the dimension table, as it hampers the understandability and performance of the dimension model as more tables would be required to be joined to satisfy the queries.

 

Characteristics of Snowflake Schema :

The dimension model of snowflake under the following conditions:

·         The snowflake schema uses small disk space.

·         It is easy to implement dimension is added to schema.

·         There are multiple tables, so performance is reduced.

·         The dimension table consists of two or more sets of attributes which define information at different grains.

·         The sets of attributes of the same dimension table are being populate by different source systems.

Structure of Datawarehouse:



Fig 3a.Datawarehouse structure

Data warehouse or also known as enterprise Datawarehouse is a set of subject-oriented data, integrated data, non-updateable, time-variant, historical information and support decision-making. The data warehouse is made to prevent the technical and business issues in the usage of data and information to make a decision. The data warehouse is also a system that aims to take and consolidate data periodically from a source system to the dimension table or storage data normalized. Generally, the data are renewed for such time not once transaction in the source system. The data warehouse is used for reporting and data analysis for such knowledge.

 

II)SNOWFLAKE SCHEMA:

Snowflake scheme method is a method of data modelling applied on the data warehouse. It is developed from method of star scheme. Those methods have trade-off in the implementation. The snowflake scheme is mixing of some data modeling of star scheme normalized. In the snowflake scheme, each dimension table may have another dimension sub-table. The data dimension has an information subject that is functioned as the material in making a decision because in each data dimension is possible to generate a description more detail. So that, the data source could be processed becomes information in a huge number and more detail. The snowflake scheme is shown in Figure.

 

III) ETL

Extraction, transformation, and loading (ETL) are a process to pull a data out from the source system and place it into the data warehouse. In the process of managing a data, the data should be filtered and fixed in order to get a good data quality. The invalid data should be cleaned. In the ETL processes, approaching of traditional is retrieving a data from a data source, and putting it in a staging area, thus transforming and loading it into a data warehouse.

 


Fig 3b.ETL Process in detail

 

IMPLEMENTATION:

          Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions.

            Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products. Multienterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays.

Some advantages and disadvantages of the use of snowflake scheme method could be written as below:

ADVANTAGES: 

1) Dimension table process divided into several subdimension that could cause stored media efficiency. This is very useful because data warehouse would contain a lot of data so fewer storage media could help data warehouse management process. 

2)A lot of sub-dimension could result in a lot of information. The higher point of view, the higher information will be gotten as needed.

3)It provides structured data which reduces the the problem of data integrity.

4)It uses small disk space because data are highly structured.

5) Normalized

6) Uses Complex joins

7) Occupies less space

 

DISADVANTAGES :

1)Query process would be complicated because it consists of many inter-correlated sub-dimension.

2) Loading data process should be managed an considered to prevent normal update or anomaly insert.

3) Snow-flaking reduces space consumed by dimension tables, but compared with the entire data warehouse the saving is usually insignificant.

4) Avoid snow-flaking or normalization of a dimension table, unless required and appropriate.

5)Do not snowflake hierarchies of one dimension table into separate tables. Hierarchies should belong to the dimension table only and should never be snowflake.

6) Multiple hierarchies can belong to the same dimension has been designed at the lowest possible detail. 

 Applications of Snowflake Schema:

·        Online Transaction Processing


In short order it was discovered that things other than database management could be done with IMS. Not only could IMS manage databases, but when coupled with a data communications (DC) component, IMS could do what is termed “online transaction processing.” This was a dramatic position for IBM. Prior to online transaction processing, the computer was able to enhance many business processes. But with the advent of online transaction processing, the computer could be woven into the day-to-day operations of the corporation. Never before had the computer been an essential ingredient to the running of the business. With online transaction processing, the computer took on a role of importance never before envisioned and what had been impossible. With online transaction processing, the organization was able to build reservation systems, such as airline car rental. With online transaction processing, there appeared online bank teller systems and ATMs. At this point IBM had a firm grip on the high ground of corporate processing.


·        Call Centre Information :

 

Most corporations have call centres. A call centre is a corporate function where the corporation staffs phone operators to have conversations with customers. With a call center, the consumer has a voice of the corporation with whom a conversation can be made. In many ways the call centre becomes the direct interface the consumer has to the corporation. The conversations that occur in the call centre are many and diverse:   

• Some people want to complain.

• Some people want to buy something.

• Some people want product information.

• Some people just want to talk.

There is then a wealth of information that transpires in the conversations that corporations with their customer or prospect base. So what does management of the corporation know about what takes place in their call centre? The answer is that management knows very little about what transpires in the call centre.


·        Medical Records :

       Call centre records are important and are at the centre of business value. But call centre records are hardly the only form of nonrepetitive records that are valuable. Another form of valuable nonrepetitive data is medical records. Medical records are written usually as a patient goes through a procedure or some event of medical care. The records, once written, are valuable to many people and organizations, including the physician, patient, hospital or provider, and research organizations. The challenge with medical records is that they contain narrative information. Narrative information is necessary and useful to the physician. But narrative information is not useful to the computer. In order to be used in analytical processing, the narrative information must be put into the form of a database in a standard database format. This is a classic case of nonrepetitive data being placed in the form of a database. What is needed is textual ETL. In order to see how textual ETL is used, consider a medical record.


 CONCLUSION :

Snowflake scheme has an advantage of efficiency in the use of storage media. Data modeling and transactional database system will ease in the data warehouse arrangement. Extract, transform, and loading (ETL) process has been done within 2 stages, namely data integrity testing and population process. In testing has been found data integrity issues because there was field foreign key that was inconsistency or null valued. So, to arrange data warehouse ought to eliminate noise data. In the ETL process, the amount of 489,119 data rows has been successfully populated into fact data. Reporting model has been arranged based on information needs using transact SQL commands and also employed as functional testing. Determining database object as source data should be concerned. In the next project, data warehouse development using Business Intelligence process will help to do analysis more over. And from that analysis could be used to build analysis service.




Authors :

Vishwakarma Institute of Technology, Pune.
Third-Year Students, Computer Department

  1. Kedar Uttarwar.
  2. Danish Tadvi.
  3. Amol Randhir.
  4. Sakshi Shinde



Comments

Popular posts from this blog

Timed Automata and it's Applications

Line coding and their types