TPCDS Schema Quick Overview

Brief Introduction to TPC-DS

TPC-DS is a favorite benchmarking tool. After executing Hadoop based TPC-DS setup scripts, it creates and populates a data warehouse in the Hadoop environment. More about this tool can be known here http://www.tpc.org/tpcds/  

It is essential to understand that a TPC-DS dataset represents a business that sells products through various channels like stores and the Internet. The dataset also contains business promotions. The TPC-DS dataset does not benchmark the operational systems.

Purpose of this post

This post provides a high-level view of TPC-DS data warehouse schema. Though TPC-DS schema documentation is available from its website, this article aims to highlight important points of the schema and be a quick reference guide to the schema.

Fundamentals of TPC-DS Schema

Following are the fundamental aspects of TPC-DS Schema.

  1. A multi-dimensional Snowflake schema (i.e. dimension tables are normalized to multiple tables)
  2. Contains Dimensions and Facts tables

Quick understanding of TPC-DS Facts Tables

Quick Byte: Remember "Sales" and "Returns" keywords
Quick Byte: Remember "Channels" - Stores and the Internet

To understand TPC-DS schema tables remember below image.

TPC-DS Schema Qucik Understanding Mindmap

Dimensions Tables

Facts Tables

Source Tables Schema