Data warehouse modelling with Snowflake Schema
ABSTRACT :
Snowflake
is a data warehouse schema design where dimension tables are normalized on top
of a star schema design. Snowflake schema is generally not recommended due to
its performance overhead in joining the normalized dimensions tables. However,
the snowflake schema can be extended in a way to improve performance for
business analysis activities in business analysis, two environments are
complementary and work together to provide effective business analytics.
Firstly, the data warehouse environment transforms operational data into information.
Secondly, the analytical environment delivers information to end users for
further data analysis and decision making. The snowflake schema bridges a gap
between two environments. This schema facilitates the mapping of wide dimension
structures with many dimensions attributes to analytical processing
hierarchies. The snowflake schema makes navigation along hierarchies easier and
support flexible analysis such as drill down and rollup. Process changes
transactional database into data warehouse could use some schemes, one of them
is snowflake scheme. The use of snowflake scheme aims to yield many data those
employ as the material of decision-making system. The different of this scheme
with other schemes is splitting fact data into many dimensional data. After
that, the next process is doing extracting, transformation, and loading (ETL)
process. The stage of the population in ETL process has an important role
because it determines the result of data those will be loaded to every
dimensional data. To guarantee data those will be populated has a good
integrity, so it should be done data cleansing namely erasing a noise data. The
successful of data populating and loading would be yielded beneficiary of the
managerial level to make a decision.
Business analytics.
INTRODUCTION TO SNOWFLAKE
SCHEMA :
Fig 1a. Datawarehouse of snowflake
In the
information world, the most important asset is data. The existence of data in
an organization is used as material to determine policies, build strategies,
and take decisions. The available data then processed into information needed
by the organization. Data processing could be done in every state, e.g. in the
operational database, operational application, or data warehouse technology
usage. Many stored data in the operational database are ineffective if it used
as material to dig managerial information. The campus library is expected to
have a data warehouse so that the process of extracting information to management
levels does not interfere with the process of transactions made at any time.
Utilization of data warehouse is basically the process that summarizes stored
data in the database system done due to a certain time period but it doesn’t do
at any time do. The existence of a data warehouse is also required as a space
of transaction considering it will extend as increasing of students each year.
Supervisors need to get summarized data of transaction data occurred on their
departments or known as ‘transactional information system’ (TPS).
At the management information system level, usually, there are
summarizing of information resulted by supervisors to synchronize with the
supervisors. At the higher level, the information should be presented in such a
way to support the decision-making process or this level known as decision
support systems. The highest information is necessarily available for library
manager needs and the summarized information are much detail in such a way and
the availability of information provided in the mechanism named by executive
information systems.
History of SnowFlake schema:
Schema is a logical
description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates.
Much like a database, a data warehouse also requires to maintain a schema. A
database uses relational model, while a data warehouse uses Star, Snowflake,
and Fact Constellation schema. In this chapter, we will discuss the schemas
used in a data warehouse.
Star Schema
· Each dimension in a
star schema is represented with only one-dimension table.
· This dimension table
contains the set of attributes.
· The following diagram
shows the sales data of a company with respect to the four dimensions, namely
time, item, branch, and location.
· There is a fact table
at the centre. It contains the keys to each of four dimensions.
· The fact table also
contains the attributes, namely dollars sold and units sold.
Note − Each
dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute
set {location key, street, city, province_or_state, country}. This constraint
may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Table
1: Difference between star and snowflake schema
What is
snowflaking?
The snowflake
design is the result of further expansion and normalized of the dimension
table. In other words, a dimension table is said to be snow flaked if the
low-cardinality attribute of the dimensions have been divided into separate
normalized tables. These tables are then joined to the original dimension table
with referential constrains(foreign key constrain). Generally, snowflaking is
not recommended in the dimension table, as it hampers the understandability and
performance of the dimension model as more tables would be required to be
joined to satisfy the queries.
Characteristics
of Snowflake Schema :
The dimension model of snowflake under the
following conditions:
· The snowflake schema uses small disk space.
· It is easy to implement dimension is added to
schema.
· There are multiple tables, so performance is
reduced.
· The dimension table consists of two or more sets of
attributes which define information at different grains.
· The sets of attributes of the same dimension table
are being populate by different source systems.
Structure of Datawarehouse:
Data warehouse or also known as enterprise Datawarehouse is a set of
subject-oriented data, integrated data, non-updateable, time-variant,
historical information and support decision-making. The data warehouse is made
to prevent the technical and business issues in the usage of data and
information to make a decision. The data warehouse is also a system that aims
to take and consolidate data periodically from a source system to the dimension
table or storage data normalized. Generally, the data are renewed for such time
not once transaction in the source system. The data warehouse is used for
reporting and data analysis for such knowledge.
II)SNOWFLAKE SCHEMA:
Snowflake scheme method is a method of data modelling applied on the
data warehouse. It is developed from method of star scheme. Those methods have
trade-off in the implementation. The snowflake scheme is mixing of some data
modeling of star scheme normalized. In the snowflake scheme, each dimension
table may have another dimension sub-table. The data dimension has an
information subject that is functioned as the material in making a decision
because in each data dimension is possible to generate a description more
detail. So that, the data source could be processed becomes information in a
huge number and more detail. The snowflake scheme is shown in Figure.
III) ETL
Extraction, transformation, and loading (ETL) are a process to pull a
data out from the source system and place it into the data warehouse. In the
process of managing a data, the data should be filtered and fixed in order to
get a good data quality. The invalid data should be cleaned. In the ETL
processes, approaching of traditional is retrieving a data from a data source,
and putting it in a staging area, thus transforming and loading it into a data
warehouse.
Fig 3b.ETL Process in detail
IMPLEMENTATION:
Large customer
dimensions where, for example, 80 percent of the fact table measurements
involve anonymous visitors about whom you collect little detail, and 20 percent
involve reliably registered customers about whom you collect much detailed data
by tracking many dimensions.
Financial product dimensions for banks, brokerage houses, and insurance
companies, because each of the individual products has a host of special
attributes not shared by other products. Multienterprise calendar dimensions because
each organization has idiosyncratic fiscal periods, seasons, and holidays.
Some advantages and disadvantages of the use of snowflake scheme method
could be written as below:
ADVANTAGES:
1) Dimension table process divided into several subdimension that could
cause stored media efficiency. This is very useful because data warehouse would
contain a lot of data so fewer storage media could help data warehouse
management process.
2)A lot of sub-dimension could result in a lot of information. The
higher point of view, the higher information will be gotten as needed.
3)It provides structured data which reduces the the problem of data
integrity.
4)It uses small disk space because data are highly structured.
5) Normalized
6) Uses Complex joins
7) Occupies less space
DISADVANTAGES :
1)Query process would be complicated because it consists of many
inter-correlated sub-dimension.
2) Loading data process should be managed an considered to prevent
normal update or anomaly insert.
3) Snow-flaking reduces space consumed by dimension
tables, but compared with the entire data warehouse the saving is usually
insignificant.
4) Avoid
snow-flaking or normalization of a dimension table, unless required and
appropriate.
5)Do not
snowflake hierarchies of one dimension table into separate tables. Hierarchies
should belong to the dimension table only and should never be snowflake.
6) Multiple hierarchies can belong to the same dimension has been designed at the lowest possible detail.
Applications of Snowflake Schema:
· Online Transaction Processing
In short order it was
discovered that things other than database management could be done with IMS.
Not only could IMS manage databases, but when coupled with a data
communications (DC) component, IMS could do what is termed “online transaction
processing.” This was a dramatic position for IBM. Prior to online transaction
processing, the computer was able to enhance many business processes. But with
the advent of online transaction processing, the computer could be woven into
the day-to-day operations of the corporation. Never before had the computer
been an essential ingredient to the running of the business. With online
transaction processing, the computer took on a role of importance never before
envisioned and what had been impossible. With online transaction processing,
the organization was able to build reservation systems, such as airline car
rental. With online transaction processing, there appeared online bank teller
systems and ATMs. At this point IBM had a firm grip on the high ground of
corporate processing.
· Call Centre Information :
Most corporations
have call centres. A call centre is a corporate function where the corporation
staffs phone operators to have conversations with customers. With a call
center, the consumer has a voice of the corporation with whom a conversation
can be made. In many ways the call centre becomes the direct interface the
consumer has to the corporation. The conversations that occur in the call centre are many and diverse:
• Some people want to
complain.
• Some people want to
buy something.
• Some people want
product information.
• Some people just
want to talk.
There is then a
wealth of information that transpires in the conversations that corporations
with their customer or prospect base. So what does management of the
corporation know about what takes place in their call centre? The answer is
that management knows very little about what transpires in the call centre.
· Medical Records :
Call centre records are important and are at the centre of business value. But call centre records are hardly the only form of nonrepetitive records that are valuable. Another form of valuable nonrepetitive data is medical records. Medical records are written usually as a patient goes through a procedure or some event of medical care. The records, once written, are valuable to many people and organizations, including the physician, patient, hospital or provider, and research organizations. The challenge with medical records is that they contain narrative information. Narrative information is necessary and useful to the physician. But narrative information is not useful to the computer. In order to be used in analytical processing, the narrative information must be put into the form of a database in a standard database format. This is a classic case of nonrepetitive data being placed in the form of a database. What is needed is textual ETL. In order to see how textual ETL is used, consider a medical record.
CONCLUSION :
Snowflake scheme has an advantage of efficiency in the use of storage
media. Data modeling and transactional database system will ease in the data
warehouse arrangement. Extract, transform, and loading (ETL) process has been
done within 2 stages, namely data integrity testing and population process. In
testing has been found data integrity issues because there was field foreign
key that was inconsistency or null valued. So, to arrange data warehouse ought
to eliminate noise data. In the ETL process, the amount of 489,119 data rows
has been successfully populated into fact data. Reporting model has been
arranged based on information needs using transact SQL commands and also
employed as functional testing. Determining database object as source data
should be concerned. In the next project, data warehouse development using
Business Intelligence process will help to do analysis more over. And from that
analysis could be used to build analysis service.
Authors :
Third-Year Students, Computer Department
Comments
Post a Comment