Data Warehouse Concepts - Data Warehouse Design
A data warehouse is a set of databases designed for query and analysis rather than for transaction processing. In a nutshell, data is brought into the data warehouse and transformed into information which is then published to a set of subscribers. The data can come from a number of sources in a number of different formats. The data warehouse should import and combine the data. Data quality issues should be addressed to promote trust in the published information.
Through the years Leopard have built many data warehouses for large corporations as wells as SME’s using a variety of technology.
While there is no “one size fits all”, a common approach has ensured a “fit for purpose” final product. This document serves to explain some of the lessons learned and basic approach to building the warehouse.
~ Robust ~
The word “Robust” could mean many things depending on the reader’s point of view. Suffice to say that the data warehouse should be:
- Free from corruption
- Performing adequately (i.e. responsive)
Robustness is affected by many things, but these three stand out:
- Technology (not covered in this paper, it is assumed the reader makes a wise choice)
- Hosting (not covered in this paper, it is assumed the reader makes a wise choice)
- Solution design - which is what this paper is all about
~ Secure ~
Security almost requires no explanation. Data should be secure from deliberate hacking as well as normal reporting (i.e. when one runs a report one should not see data that should not be visible).
Sensitive data should be ring-fenced in two ways:
- Incoming feeds should not share a staging database with other data providers if at all possible.
- Published information should be secured in a variety of ways
In addition to the above, the data warehouse should have a method to log what was viewed, by whom, when and from where. Direct access by end users to tables is very rarely allowed, usually access is via some form of view or stored procedure.
Most data warehouses require the ability to enter certain data. This could be as elementary as a simple mapping table between data sources or a master table of countries. Data changes should be logged to promote trust and prevent malicious activity.
~ Flexible ~
It should be possible to make changes to the data warehouse in such a way as to keep disruption to a minimum and to ensure that existing information, data and functionality are not compromised.
It should also be possible to allow for flexibility in terms of providing different ways of getting to the information. Self-serve reporting environments, for example, offer a good return on investment. Specialist data analysts should also have access to information using their tool of choice.
~ Data Quality Assurance ~
Data quality should be mentioned because it is probably the main reason why subscribers stop using the information provided by the data warehouse.
Data Quality is often neglected during project planning and implementation, and is often covered as an afterthought at some point in the project. It should be possible to:
- Understand how data was processed and aggregated (the unknown causes distrust especially when the final output disagrees with the water cooler discussions)
- Measure and report on data quality KPI’s. Trust in results may be significantly increased if the KPIs show an improving trend over time
These are the main issues faced in terms of data quality:
- Missing feed. This is where a feed is entirely missing. In such cases it may be necessary to halt the process of publishing any information
- Too many rows. If a feed usually contains 3256 rows and then out of the blue there are 756586 rows there may be an issue.
- Too few rows. This is pretty self explanatory.
- Increase in orphan data rows beyond an acceptable level. This may require manual intervention such as entry into a mapping table, or fix the feed and re-submit
- Unacceptable high or low values of specific metrics. Some feeds should be rejected when certain values are out of normal range. These could be based on existing data and/or limits entered by an administrator.
It may be necessary to publish data quality information separately to the normal information.
Some data quality metrics would be published via the portal, such as when last data was updated.
One Version of the Truth
Although “one version of the truth” is really about data quality it is so important that it should be mentioned separately. Nothing reduces trust in information like multiple versions of the truth. The challenge is that different versions of the truth could be published by other systems.
~ Fit for Purpose ~
in terms of data and information output
This is somewhat a “catchall” phrase but should be mentioned. The success (or failure) of a data warehouse project is often measured in terms of time and money. This article suggests that the main criteria for measuring the success of a data project is whether it is fit for purpose or not.
Does it provide the information required?
Does it provide it on time?
Is the information trusted?
Is the data and information secure?
The Construction of the Data Warehouse
This article describes basic concepts for simple, average and complex data warehouses.
NOTE: complex design should be avoided in favour of “simple”, and should be driven by requirements.
Simple Data Warehouse
This is really a single database solution. Everything goes into that one database and either there is no sensitive data or the providers and subscribers have or should have access to “everything”.
Average or “Standard” Data Warehouse
Complexity has increased from the simple data warehouse and there could be a number of data providers and subscribers. Data providers could be ring-fenced in terms of security.
The standard data warehouse has three databases:
~ Staging ~
This is where data comes in from a number of different sources. It is possible to increase complexity by using multiple staging databases in order to stop some data providers from seeing the data fed from other data providers. Staging databases need to be located where they are accessible to the data providers. This could be inside or outside the firewall.
~ Foundation ~
This is where the data is processed and transformed. In many ways, this is where all the action takes place. Often this database would sit within the firewall because it may contain very sensitive data.
~ Publish ~
Is is from this database that information is published to subscribers (i.e. this is where their reports draw the data from). It is only when information is ready in terms of availability and data quality that it is made available to subscribers in the publish database. In organisations with complex firewalls this database might be located on a different server outside the firewall (i.e. on the web).
The Complex / Big Data Warehouse
Complexity has increased yet again. There could be a number of data providers and subscribers. Data providers are ring-fenced in terms of security
The complex or “Big” data warehouse could have any number of databases, depending on the complexity of the incoming data and published information (and data).
Incoming feeds could be from any source in any format. Generally a push-pull approach is advised. “Going into” the other system to fetch the data should be avoided where possible.
~ Staging ~
This is where data comes in from different sources.
Staging databases need to be located where they are accessible to the data providers. This could be inside or outside the firewall.
Complex data warehouses cater for sensitive and non-sensitive incoming feeds. There could be as many staging databases as required. Multiple staging databases are implemented in order to stop some data providers from seeing the data fed from other data providers. Although this can be achieved in other ways, multiple databases are simple and easy to administer.
In the diagram (above) there are two staging databases. One holds general data and this could be provided by multiple providers. The other holds sensitive data which would probably be from one provider.
~ Foundation ~
Generally only one foundation database is required. It is in this database that processing and transformation takes place.
Data is linked and quality assurance handled.
Data could be pre-processed and pre-aggregated. Links are often written permanently into tables to improve performance.
Indexes are constructed to improve performance of the transformation process.
It is in the foundation database that data quality is handled. There may be a requirement for editors to handle mapping tables, specific methods of aggregation etc. Editors could be developed using a number of different technologies, the preferred being some sort of web page.
~ Publish ~
Is is from this database that information is published to subscribers. Only once the data (and information) is ready can it be published.
If there are multiple groups of subscribers the information could be placed in multiple publish databases. For example, there could be two groups of subscribers:
- Those allowed to see only basic sales and customer information
- Those allowed to see everything including bill of material and supplier information
It would be possible to limit access in some front ends by employing basic security (such as in a web page) but what if the subscribers included analysts using various analytical tools and required direct access to the database?
~ Subscribers ~
In this context a subscriber is someone or a system that “reads” the information provided by the data warehouse. Many front ends could be considered. The example above includes Excel, Business Objects and various web browsers. The subscribers should include outgoing feeds to other systems.