Data Store Scaling: Database, Sharding, and Consistent Hashing

What Data Does an E-commerce System Store?

In a comprehensive e-commerce platform like Amazon, various types of data are necessary for its functioning and to provide users with an optimal shopping experience:

User/Customer Data: This includes information like names, addresses, and payment details. It forms the core of personalizing the user experience.
Product Data: Details like product titles, descriptions, images, and prices form a critical part of the shopping journey.
Order Data: Data on order IDs, product IDs, invoice details, item quantities, and delivery/shipping details help in tracking and managing purchases.
Inventory Data: Information on stock levels, prices, quantities, and availability ensure accurate product listing and facilitate effective inventory management.
User Interaction Data: Capturing user search queries, product behavior, and reviews help in improving user experience and making informed business decisions.

How Do We Choose the Data Store?

Choosing the right data store is a critical aspect of designing an effective e-commerce system. Here are the key factors to consider:

Data Model: Consider the structure and complexity of your data. A relational database may be a good fit for structured data, whereas NoSQL databases can handle more complex, unstructured data better.
Scalability: Look at the system's throughput, latency, and volume requirements. Will your data store be able to scale as these grow?
Consistency: Consistency is crucial for maintaining accurate and current data. Choose a data store that provides the right level of consistency for your needs.
Availability: Your data store should be highly available to prevent any downtime that could impact the user experience.
Cost: Keep in mind the costs associated with setting up and maintaining the data store.
Performance Requirements: Identify whether your system is read-heavy or write-heavy, and select a data store that performs best under those conditions.
Data Security & Compliance: Your data store must meet all necessary security standards and compliance requirements.
Community and Support: The availability of robust community support can be invaluable when troubleshooting issues.
Future Requirements: Always keep an eye on the future. As your system evolves, so will your data storage needs.

Different Types of Data Store

Type	Subtype	Description	Examples
Database	Relational	Organizes data into tables. Good for structured data and complex queries.	MySQL, PostgreSQL, OracleDB, SQLite, MariaDB
	Non-Relational - Document Stores	Stores data as documents. Ideal for storing semi-structured data.	MongoDB, CouchDB, RavenDB
	Non-Relational - Key-Value Stores	Stores data as a collection of key-value pairs. Very fast and simple.	Redis, DynamoDB, Riak
	Non-Relational - Columnar Databases	Stores data by columns instead of rows. Ideal for analytics and big data.	Apache Cassandra, HBase
	Non-Relational - Graph Databases	Stores data as nodes and edges. Suitable for interconnected data.	Neo4j, Amazon Neptune
	Non-Relational - Time Series Databases	Optimized for time-stamped data. Suitable for IoT, telemetry, etc.	InfluxDB, OpenTSDB
Data Store	In-Memory Databases	These databases store all their data in the main memory (RAM) of the server, offering very high speed and low latency, ideal for caching and real-time applications.	Redis, Memcached
	Distributed File Systems	These are file systems that allow data to be stored across multiple machines but accessed and manipulated like it's on one. They are great for big data applications.	Hadoop Distributed File System (HDFS), Google File System (GFS)
	Content Delivery Network (CDN)	CDNs are a globally distributed network of servers that provide fast delivery of internet content. They cache static resources closer to users for improved performance.	Cloudflare, Akamai, Amazon CloudFront
	Data Warehouses	These are large repositories of data collected from different sources, designed to support business intelligence activities, particularly analytics. Data is consolidated, transformed and stored at a granular level.	Google BigQuery, Amazon Redshift, Snowflake
	Message Brokers	These are tools that receive incoming data from applications and send it to different applications for processing. They act as a buffer for incoming data and can help manage and process large amounts of data where delivery time is not a concern.	Apache Kafka, RabbitMQ, Google Pub/Sub

SQL vs NoSQL

	SQL Databases	NoSQL Databases
Data Model	Relational, data is structured and organized into tables.	Non-relational, can handle structured, semi-structured, and unstructured data.
Consistency	ACID Compliance Strong Consistency	NoSQL databases may offer strong consistency or eventual consistency. Async
Schema	Fixed schema. Data must adhere to defined structure.	Schema-less. Provides flexibility to store diverse data models (key-value, document, column, graph).
Scalability	Scale vertically by increasing the horsepower (CPU, RAM, SSD) of an existing server.	Scale horizontally by adding more servers to handle more traffic.
Transactions	Supports ACID (Atomicity, Consistency, Isolation, Durability) transactions.	Many do not support ACID transactions, though there are exceptions like MongoDB. Typically, they provide eventual consistency.
Query Language	Use Structured Query Language (SQL) which is powerful for complex queries.	No standard query language. Methods for data manipulation vary.
Examples	MySQL, PostgreSQL, Oracle Database.	MongoDB, Apache Cassandra, Redis, Amazon DynamoDB.

Horizontal Scaling vs Vertical Scaling

	Horizontal Scaling (Scale Out)	Vertical Scaling (Scale Up)
Definition	Involves adding more nodes (servers) to a system to increase capacity.	Involves adding more power (CPU, RAM, storage) to an existing machine.
Details	Data is distributed across multiple servers. The load is balanced across multiple nodes, which can be complex to manage but provides high scalability.	The system performance is increased by adding more computing power, RAM, and disk space to the existing machine.
Benefits	High availability due to data distribution, easier to scale, potentially more cost-effective in the long run.	Simplicity in configuration and management, improved performance for single-threaded processes, consistency is easier to maintain.
Drawbacks	More complex to manage due to data distribution, potential consistency issues due to data propagation delay.	Limited by the maximum capacity of a single machine, potential downtime for upgrades, single point of failure.

Database Partitioning

Breaking a large database into smaller databases is typically referred to as database partitioning. Partitioning can significantly improve the performance, availability, and manageability of large-scale systems.

Horizontal Partitioning (Sharding): In horizontal partitioning, the database is divided into smaller parts or "shards" based on the rows of a table. Each shard contains a subset of the total rows and functions as a smaller independent database. The criteria used to partition the data could be a specific range of values, a list of values, or a hash function on a particular column. Horizontal partitioning is commonly used in distributed databases and helps improve query performance and scalability.
Vertical Partitioning: In vertical partitioning, the database is divided based on the columns of a table. Each partition contains a subset of the table's columns along with the primary key to maintain a link to the original data row. This can improve performance by reducing the amount of data that needs to be loaded from disk for queries that only require a subset of a table's columns.
Functional Partitioning: In functional partitioning, data is divided based on its function or use-case. For example, an e-commerce application might partition its data into separate databases for user accounts, product catalog, and orders. This can make each partition simpler and easier to manage but requires careful planning to ensure efficient data access across partitions.
Directory-Based Partitioning: In this type of partitioning, a lookup service or directory keeps track of which data is stored where. The directory determines where data is located, helping route requests to the appropriate location.

Logical Sharding vs Physical Sharding

	Logical Sharding	Physical Sharding
Definition	Data is split into discrete shards based on a logical condition, but all shards may reside on the same physical server.	Each shard is stored on a separate physical server or machine.
Benefits	Easier to manage and more flexible, improves query efficiency.	Allows for greater scalability, can handle larger data volumes and higher load.
Drawbacks	Limited by the capacity and performance of a single machine.	More complex to manage, involves dealing with network latency, failures, and data consistency.
Best Used When	The size of the database is manageable within a single server and the main goal is to improve efficiency of data management.	The size of the database exceeds the storage capacity of a single machine, or the workload needs to be distributed across multiple machines.

Algorithmic Sharding vs Dynamic Sharding

Attribute	Algorithmic Sharding	Dynamic Sharding
Definition	Uses a consistent algorithm to determine where data should go.	Adapts to changes in data distribution and load.
Examples	Range-based Sharding, Hash-based Sharding	Directory-based Sharding, Geolocation-based Sharding
Benefits	Straightforward and fast, easy to determine where data is located.	Flexibility in adding/removing shards, better scalability
Drawbacks	Inflexible, difficult to add or remove shards.	Complexity, potential consistency issues.
Best Used When	The number of shards is stable and the distribution of data is relatively uniform.	The distribution of data changes frequently, or there are requirements for more granular control.

Different Strategies for Sharding

Strategy	Description	Example Use-case
Range-based Sharding	Data is distributed based on a certain range of a key.	User IDs: Users with IDs 1 to 1000 might go to Shard 1, IDs 1001 to 2000 to Shard 2, and so on.
Hash-based Sharding	A hash function is applied to a certain data key, and the output of the function determines the shard.	Usernames: A hash function applied to usernames could distribute them evenly across shards.
Directory-based Sharding	A lookup table or service keeps track of where data is stored.	Document Database: A lookup table can keep track of which documents are stored in which shards.
Attribute-based Sharding	Data is sharded based on specific attributes of the data.	Geographic Regions: Data could be sharded based on the geographic region of users to optimize data locality.
Random Sharding	Data is distributed across shards randomly.	High write throughput scenarios: When the write distribution needs to be even and quick, and there's less emphasis on read speed or complex queries.

Sharding Scenario: Adding a Database in a Hash-based Sharding Strategy

In this scenario, we start with 4 databases (DB1 to DB4) and use a hash-based sharding strategy. We apply a hash function to our data key (e.g., user ID), which yields a range of 0 to 400. We distribute the data across our databases as follows:

0 to 100 -> DB1
100 to 200 -> DB2
200 to 300 -> DB3
300 to 400 -> DB4

As our load increases, we decide to add an additional database (DB5). However, the range of our hash function remains the same (0 to 400), requiring us to redistribute the data across the new set of databases:

0 to 80 -> DB1
80 to 160 -> DB2
160 to 240 -> DB3
240 to 320 -> DB4
320 to 400 -> DB5

Reorganization of Data

This addition requires us to reorganize a substantial portion of our data. Assuming a uniform distribution of data and that our hash function produces evenly distributed results

If we assume 400 records in total distributed uniformly across DB1 to DB4, that's 100 records in each database. After adding DB5 and changing the ranges, we have:

DB1: Will lose 20 records (20% of 100) to DB2.
DB2: Will lose 40 records (40% of 100) to DB3, but will gain 20 records from DB1.
DB3: Will lose 60 records (60% of 100) to DB4, but will gain 40 records from DB2.
DB4: Will lose 80 records (80% of 100) to DB5, but will gain 60 records from DB3.
DB5: Will gain 80 records from DB4.

This results in a total of 400 records, evenly distributed across all databases. ( which is the total n.o of records)

Consistent Hashing

Consistent hashing is a method used to distribute data across multiple nodes in such a way that the reorganization of data is minimized when nodes are added or removed. It's particularly important for distributed systems to improve scalability and availability.

In consistent hashing, data points (or keys) and nodes are mapped to a circular numeric space often visualized as a ring, commonly referred to as the "hash ring". This ring represents the entire hash value range.

When a node is added, it is placed at a point on the ring determined by the hash of its identifier. When a key needs to be stored or retrieved, it is assigned to a node by hashing the key, using the same hash function, to find a point on the ring, and then walking the ring clockwise to find the first node encountered.

One of the primary benefits of consistent hashing is that adding or removing a node only affects the keys of that node and its immediate neighbor, minimizing the amount of data that needs to be moved around in the system.

Example

To illustrate this, let's consider a system with 3 nodes (N1, N2, N3) and 6 data keys (K1 to K6). For simplicity, let's assume that the hashes of these nodes and keys have the following values:

Node/Key	Hash
N1	10
N2	30
N3	50
K1	12
K2	15
K3	32
K4	38
K5	48
K6	55

If we plot these on a hash ring, the data keys would be distributed to the nodes as follows:

Node	Keys
N1	K1, K2
N2	K3, K4
N3	K5, K6

Now, if we add a new node, N4, with a hash value of 40:

Node/Key	Hash
N1	10
N2	30
N3	50
N4	40
K1	12
K2	15
K3	32
K4	38
K5	48
K6	55

The keys would be redistributed as follows:

Node	Keys
N1	K1, K2
N2	K3
N3	K6
N4	K4, K5

Only the keys K4 and K5 have to be moved to the new node N4, and K3 will remain with N2 and K6 with N3. This example demonstrates how adding a node in consistent hashing does not require a massive redistribution of keys, unlike in some other types of hashing.

The Problem

In a simple consistent hashing scheme, where each physical node corresponds to a single point on the hash ring, the distribution of data can be uneven. This can occur because the placement of nodes on the ring is based on the hash of their identifiers, which is essentially random. Therefore, some nodes might end up with a larger segment of the ring (and thus a larger share of the data and load), while others might end up with a smaller segment.

The Solution - Virtual Nodes

To address this issue, each physical node is represented on the hash ring by multiple virtual nodes. Each vnode is assigned a random hash value, just like a physical node would be, resulting in multiple points on the hash ring that correspond to the same physical node.

By using a larger number of vnodes, the probability that the nodes will be evenly distributed around the hash ring increases. As a result, the data and load are more likely to be balanced among the physical nodes.

Practical Implications

In addition to helping balance the data and load, using virtual nodes also improves the system's resilience and scalability:

Resilience: When a physical node becomes unavailable, its vnodes and the data they are responsible for are spread out across multiple other nodes, not just dumped onto a single neighbor. This distribution of the load can prevent a single node from becoming a bottleneck, maintaining the system's overall performance.
Scalability: When a new node is added, it takes over responsibility for some of the vnodes from several existing nodes. This gradual assimilation of data helps maintain system performance during scaling operations, and can be much more efficient than a large-scale data reorganization.

Quiz

Question	Answer Options	Correct Answer
What is sharding in the context of distributed databases?	- The process of encrypting data during transmission. - The practice of compressing data to reduce storage space. - The technique of splitting a database into smaller, independent fragments called shards. - The process of replicating data across multiple servers for redundancy.	Ans! The technique of splitting a database into smaller, independent fragments called shards.
What is the main challenge associated with sharding in a distributed database?	- Ensuring strong consistency among all shards. - Managing data replication across multiple shards. - Handling queries that involve data from multiple shards. - Implementing data encryption for each shard.	Ans! Handling queries that involve data from multiple shards.
Which data model does a relational database follow?	- Hierarchical data model - Key-value data model - Tabular data model - Document data model	Ans! Tabular data model
Which types of databases are typically considered as relational databases?	- MongoDB - PostgreSQL - Cassandra - MySQL - Oracle Database	Ans! PostgreSQL MySQL Oracle Database
Which types of databases are typically considered as non-relational databases?	- MongoDB - PostgreSQL - Cassandra - MySQL - Redis	Ans! MongoDB Cassandra Redis
Which of the following are advantages of using a non-relational database?	- High scalability and horizontal partitioning of data. - Strict data consistency with ACID transactions. - Ability to handle unstructured and semi-structured data. - Better support for complex SQL queries and joins.	Ans! High scalability and horizontal partitioning of data. Ability to handle unstructured and semi-structured data.
What type of data model does a non-relational database follow?	- Hierarchical data model - Key-value data model - Graph data model - Document data model	Ans! Hierarchical data model Key-value data model Graph data model Document data model
In a consistent hashing scheme, which nodes are affected when a new node is added or removed?	- Only the node that is being added or removed. - All nodes in the system. - A small subset of nodes around the new node. - The nodes with the highest load in the system.	Ans! A small subset of nodes around the new node.
What is the advantage of consistent hashing over traditional hashing techniques in distributed systems?	- It requires less computational overhead for hashing. - It guarantees strong consistency and data integrity. - It reduces the number of nodes required in the system. - It enables efficient data redistribution when the system scales.	Ans! It enables efficient data redistribution when the system scales.
How does consistent hashing help in minimizing data movement when nodes are added or removed?	- It redistributes all data across all nodes in the system. - It only remaps a small fraction of the data to the new nodes. - It replicates all data across multiple nodes for redundancy. - It encrypts all data to ensure data integrity during movement.	Ans! It only remaps a small fraction of the data to the new nodes.
What role does hashing play in the consistent hashing implementation?	- Hashing is used to convert data keys into numerical values to determine node placement. - Hashing is used to encrypt data keys for secure storage. - Hashing is used to compress data keys to conserve storage space. - Hashing is used to ensure strong consistency among all nodes.	Ans! Hashing is used to convert data keys into numerical values to determine node placement.
In a consistent hashing implementation, how are nodes represented in the data structure?	- As elements in an array or a linked list. - As key-value pairs in a hash table. - As nodes in a binary tree. - As entries in a circular linked list.	Ans! As entries in a circular linked list.
In a consistent hashing implementation, what prevents data keys from being clustered around a few nodes?	- Load balancing algorithms. - Consistent hash ring rotation. - The use of virtual nodes. - Data replication.	Ans! The use of virtual nodes.
What is the purpose of adding virtual nodes in consistent hashing implementation?	- To represent nodes with higher processing capacity in the ring. - To simplify the data redistribution process during node addition or removal. - To create backup copies of data keys for fault tolerance. - To increase the size of the ring for scalability.	Ans! To simplify the data redistribution process during node addition or removal.
What property of a hash function is crucial for consistent hashing implementation?	- High collision resistance - Deterministic output - High computational complexity - Variable output size	Ans! Deterministic output