Streaming Architecture Development – Points to ponder

What is a Data Stream

Any data that is continuously flowing

Main Stages of Data Stream Management

Collection – Analyze – Consume

Architecture Components

  1. Collection (Data Integration)
  2. Storage
  3. Stream Processing / Analysis
  4. Consumption

Architecture Characteristics

  1. Data Stream Availability (time pattern aspect fixed/pattern/intermittent)
    1. Real-Time ( macros to < milli-seconds)
    2. Near Real-Time (seconds to Minutes)
    3. Mini Batch (hour)
    4. Immutable Data Stream
  2. Pay Load / Type of Data
    1. In Bytes / KBs
    2. In MBs (Images, files)
  3. Storage & Data Stream Consumption Access Pattern
    1. Storage
      1. Time oriented (time series)
      2. Shard / Partition support
      3. Consumer Re-playability / Multiple re-reads
    2. Consumption Access Pattern
      1. Point to Point Consumption – store data in as-is to source data
      2. Multiple consumers – canonical view (JSON, AVRO)
      3. Latest and greatest data of a particular type of data stream (type could be primary key OR composite key or a whole data source latest state) – structured streaming
  4. Security
    1. Row level security
    2. Field level security
    3. Data Source level security
  5. Data Stream Processing
    1. Row oriented processing
    2. Mini-Batch processing
    3. Incremental and continuous processing
    4. Stateful and Stateless data stream management
    5. Serverless
    6. Chaining data stream processing
    7. Infrastructure As A Code support to kickstart Data Stream Processing
  6. Source Data Steam & Data Stream Processed Output Data – Schema Management and Registry
  7. Scalability, Failover and Accessibility
    1. Fault-tolerant system
    2. Highly Available
    3. Distributed data storage management and data stream processing


TPCDS Schema Quick Overview

Brief Introduction to TPC-DS

TPC-DS is a favorite benchmarking tool. After executing Hadoop based TPC-DS setup scripts, it creates and populates a data warehouse in the Hadoop environment. More about this tool can be known here  

It is essential to understand that a TPC-DS dataset represents a business that sells products through various channels like stores and the Internet. The dataset also contains business promotions. The TPC-DS dataset does not benchmark the operational systems.

Purpose of this post

This post provides a high-level view of TPC-DS data warehouse schema. Though TPC-DS schema documentation is available from its website, this article aims to highlight important points of the schema and be a quick reference guide to the schema.

Fundamentals of TPC-DS Schema

Following are the fundamental aspects of TPC-DS Schema.

  1. A multi-dimensional Snowflake schema (i.e. dimension tables are normalized to multiple tables)
  2. Contains Dimensions and Facts tables

Quick understanding of TPC-DS Facts Tables

Quick Byte: Remember "Sales" and "Returns" keywords
Quick Byte: Remember "Channels" - Stores and the Internet

To understand TPC-DS schema tables remember below image.

TPC-DS Schema Qucik Understanding Mindmap

Dimensions Tables

Facts Tables

Source Tables Schema