Newer
Older
cactus / DATA_SYNC_GUIDE.md
@agalyaramadoss agalyaramadoss on 16 Feb 17 KB added document

Data Synchronization: Node Down and Recovery in Cube Database

Overview

When a node goes down and comes back up, Cube database uses three mechanisms to ensure data synchronization:

  1. Hinted Handoff - Store missed writes and replay them
  2. Read Repair - Fix inconsistencies during reads
  3. Anti-Entropy Repair - Periodic full synchronization (Phase 3 feature)

How It Works: Complete Flow

Scenario: 3-Node Cluster, Node 3 Goes Down

Initial State:
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node 1  │  │ Node 2  │  │ Node 3  │
│  ALIVE  │  │  ALIVE  │  │  ALIVE  │
└─────────┘  └─────────┘  └─────────┘
     ✓            ✓            ✓

Phase 1: Node Goes Down

Node 3 crashes or loses network:
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node 1  │  │ Node 2  │  │ Node 3  │
│  ALIVE  │  │  ALIVE  │  │  DOWN   │
└─────────┘  └─────────┘  └─────────┘
     ✓            ✓            ✗

What Happens to New Writes?

Write with CL=QUORUM (2 of 3 nodes)

// Client writes a new key
PUT user:alice "Alice Johnson" CL=QUORUM

Flow:
1. Coordinator (Node 1) receives write
2. Determines replicas: [Node1, Node2, Node3]
3. Sends write to all 3 nodes
4. Node1: ✓ Success (local write)
5. Node2: ✓ Success (network write)
6. Node3: ✗ Timeout (node is down)
7. Success count: 2/3
8. Required for QUORUM: 2/3
9. Result: ✓ WRITE SUCCEEDS

// Store hint for Node 3
10. Coordinator stores "hint" for Node3:
    - Key: user:alice
    - Value: "Alice Johnson"
    - Timestamp: 1234567890
    - Target: Node3

Hinted Handoff in Action:

// On Node 1 (coordinator)
hintedHandoff.storeHint(
    "node-3",              // Target node
    "user:alice",          // Key
    "Alice Johnson",       // Value
    timestamp              // When it was written
);

// Hint is persisted to disk:
// /var/lib/cube/hints/node-3/user_alice-1234567890.hint

Phase 2: While Node 3 is Down

Multiple writes happen:

Time T1: PUT user:bob "Bob Smith" CL=QUORUM
  → Writes to Node1, Node2
  → Hint stored for Node3

Time T2: PUT user:charlie "Charlie Brown" CL=QUORUM
  → Writes to Node1, Node2
  → Hint stored for Node3

Time T3: UPDATE user:alice SET value="Alice J." CL=QUORUM
  → Writes to Node1, Node2
  → Hint stored for Node3

State:
Node1: alice=Alice J., bob=Bob Smith, charlie=Charlie Brown
Node2: alice=Alice J., bob=Bob Smith, charlie=Charlie Brown
Node3: [DOWN - has old data]

Hints for Node3:
  - user:alice (v2, timestamp T3)
  - user:bob (timestamp T1)
  - user:charlie (timestamp T2)

Phase 3: Node 3 Comes Back Online

Node 3 recovers:
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node 1  │  │ Node 2  │  │ Node 3  │
│  ALIVE  │  │  ALIVE  │  │  ALIVE  │
└─────────┘  └─────────┘  └─────────┘
     ✓            ✓            ✓ (recovered)

Automatic Hint Replay

// HintedHandoffManager detects Node3 is alive
// Automatically triggers replay

for (Hint hint : hintsForNode3) {
    if (!hint.isExpired()) {
        // Send hint to Node3
        boolean success = sendHintToNode(
            node3,
            hint.getKey(),
            hint.getValue(),
            hint.getTimestamp()
        );
        
        if (success) {
            // Delete hint file
            deleteHint(hint);
        }
    }
}

Replay Process:

Step 1: Node 1 detects Node 3 is alive
  → Heartbeat received from Node 3

Step 2: Trigger hint replay
  → hintedHandoff.replayHintsForNode("node-3")

Step 3: Replay each hint in order
  Hint 1: user:alice = "Alice J." (timestamp T3)
    → Node3.put("user:alice", "Alice J.", T3)
    → ✓ Success

  Hint 2: user:bob = "Bob Smith" (timestamp T1)
    → Node3.put("user:bob", "Bob Smith", T1)
    → ✓ Success

  Hint 3: user:charlie = "Charlie Brown" (timestamp T2)
    → Node3.put("user:charlie", "Charlie Brown", T2)
    → ✓ Success

Step 4: Delete replayed hints
  → All hints successfully replayed and deleted

After Hint Replay:

Node1: alice=Alice J., bob=Bob Smith, charlie=Charlie Brown
Node2: alice=Alice J., bob=Bob Smith, charlie=Charlie Brown
Node3: alice=Alice J., bob=Bob Smith, charlie=Charlie Brown
       ✓ NOW IN SYNC!

Read Repair: Catching Missed Updates

Even with hinted handoff, some data might be missed. Read repair fixes this:

Example: Read After Node Recovery

// Client reads user:alice with CL=QUORUM
GET user:alice CL=QUORUM

Flow:
1. Coordinator reads from Node1, Node2, Node3
2. Receives responses:
   Node1: "Alice J." (timestamp T3)
   Node2: "Alice J." (timestamp T3)
   Node3: "Alice"    (timestamp T0) ← OLD DATA!

3. Read Repair detects inconsistency
   - Canonical value: "Alice J." (newest timestamp T3)
   - Stale replica: Node3

4. Async repair triggered:
   → Send "Alice J." to Node3
   → Node3 updates to latest value

5. Return to client:
   → Value: "Alice J."
   → Repair performed in background

Read Repair Implementation:

// In ReadRepairManager
List<ReadResponse> responses = new ArrayList<>();
responses.add(new ReadResponse(node1, "user:alice", "Alice J.", T3));
responses.add(new ReadResponse(node2, "user:alice", "Alice J.", T3));
responses.add(new ReadResponse(node3, "user:alice", "Alice", T0)); // Stale

// Detect conflicts
ReadRepairResult result = readRepair.performReadRepair(responses, 
    (node, key, value, timestamp) -> {
        // Repair Node3
        node.put(key, value, timestamp);
        return true;
    }
);

// Result:
// - Canonical value: "Alice J."
// - Repaired nodes: 1 (Node3)

Complete Synchronization Flow Diagram

┌──────────────────────────────────────────────────────────────┐
│  WRITE PHASE (Node 3 is DOWN)                               │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Client → PUT user:alice "Alice" CL=QUORUM                  │
│     │                                                        │
│     ▼                                                        │
│  ┌──────────┐         ┌──────────┐         ┌──────────┐    │
│  │  Node 1  │ ──────→ │  Node 2  │    ✗    │  Node 3  │    │
│  │ (Coord)  │         │          │         │  (DOWN)  │    │
│  └──────────┘         └──────────┘         └──────────┘    │
│     │ ✓ Write            │ ✓ Write            │ ✗ Timeout  │
│     │                    │                    │             │
│     ▼                    ▼                    │             │
│  alice=Alice         alice=Alice              │ (no data)   │
│     │                                         │             │
│     └─────────────────────────────────────────┘             │
│                    │                                        │
│                    ▼                                        │
│            Store Hint for Node3                            │
│            /hints/node-3/alice.hint                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  RECOVERY PHASE (Node 3 comes back)                         │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Node 3 sends heartbeat                                     │
│     │                                                        │
│     ▼                                                        │
│  ┌──────────┐                          ┌──────────┐         │
│  │  Node 1  │  ─── Replay Hints ────→  │  Node 3  │         │
│  │          │                          │          │         │
│  └──────────┘                          └──────────┘         │
│     │                                      │                │
│     │ Load hints from disk                │ ✓ Receive       │
│     │ /hints/node-3/alice.hint            │   alice=Alice   │
│     │                                      │                │
│     └──────────────────────────────────────┘                │
│                                                              │
│  Result: Node 3 now has alice=Alice                         │
│                                                              │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  READ REPAIR PHASE (if hint missed or failed)               │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Client → GET user:alice CL=QUORUM                          │
│     │                                                        │
│     ▼                                                        │
│  Read from all replicas:                                    │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │  Node 1  │    │  Node 2  │    │  Node 3  │              │
│  │alice=v2  │    │alice=v2  │    │alice=v1  │ ← STALE!     │
│  │t=T2      │    │t=T2      │    │t=T1      │              │
│  └──────────┘    └──────────┘    └──────────┘              │
│     │                │                │                     │
│     └────────────────┴────────────────┘                     │
│                      │                                      │
│                      ▼                                      │
│              Compare responses                              │
│              Newest: v2, t=T2                              │
│                      │                                      │
│                      ▼                                      │
│              Async repair Node3                            │
│              Node3.put(alice, v2, T2)                      │
│                      │                                      │
│                      ▼                                      │
│              ✓ All nodes in sync!                          │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Implementation Details

1. Hinted Handoff Configuration

// Initialize hinted handoff manager
HintedHandoffManager hintedHandoff = new HintedHandoffManager(
    "/var/lib/cube/hints",  // Hints directory
    10000,                   // Max 10K hints per node
    3600000                  // 1 hour hint window
);

// Hints are automatically:
// - Persisted to disk
// - Replayed when node recovers
// - Deleted after successful replay
// - Expired after hint window

2. Automatic Hint Replay

// Background thread checks for node recovery
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
scheduler.scheduleAtFixedRate(() -> {
    for (ClusterNode node : cluster.getNodes()) {
        if (node.isAlive() && hintedHandoff.getHintCount(node.getId()) > 0) {
            // Node is alive and has pending hints
            hintedHandoff.replayHintsForNode(node.getId(), hint -> {
                // Send hint to recovered node
                return sendToNode(node, hint.getKey(), hint.getValue());
            });
        }
    }
}, 10, 10, TimeUnit.SECONDS); // Check every 10 seconds

3. Read Repair Probability

// Configure read repair probability
ReadRepairManager readRepair = new ReadRepairManager(10); // 10% chance

// On every read:
if (readRepair.shouldPerformReadRepair()) {
    // Perform read repair
    compareResponses();
    repairStaleReplicas();
}

Handling Different Failure Scenarios

Scenario 1: Short Outage (Minutes)

Timeline:
T0: Node3 goes down
T1-T5: Writes happen (hints stored)
T6: Node3 comes back (5 minutes later)

Recovery:
✓ Hinted handoff replays all missed writes
✓ Node3 catches up in seconds
✓ Read repair handles any edge cases

Scenario 2: Medium Outage (Hours)

Timeline:
T0: Node3 goes down
T1-T100: Many writes happen
T101: Node3 comes back (2 hours later)

Recovery:
✓ Hinted handoff replays accumulated hints
✓ May take a few minutes depending on hint count
✓ Read repair fixes any missed updates

Scenario 3: Long Outage (Days)

Timeline:
T0: Node3 goes down
T1-T1000: Massive number of writes
T1001: Node3 comes back (3 days later)

Issues:
✗ Hints may have expired (hint window: 1 hour default)
✗ Too many hints to store (max hints: 10K per node)

Recovery:
⚠ Hinted handoff may be incomplete
✓ Read repair will fix data over time
✓ Manual repair may be needed (nodetool repair)

Scenario 4: Network Partition

Initial:
DC1: Node1, Node2 (can communicate)
DC2: Node3 (isolated)

Writes in DC1:
✓ CL=QUORUM succeeds (2/3 nodes)
✓ Hints stored for Node3

When partition heals:
✓ Hints replayed to Node3
✓ Read repair fixes inconsistencies

Configuration Best Practices

Hint Window Configuration

// Short-lived outages (default)
HintedHandoffManager(hintsDir, 10000, 3600000);  // 1 hour

// Medium-lived outages
HintedHandoffManager(hintsDir, 50000, 7200000);  // 2 hours

// Long-lived outages (not recommended)
HintedHandoffManager(hintsDir, 100000, 86400000); // 24 hours

Recommendation: 1-3 hours is optimal. Longer windows risk:

  • Too much disk space for hints
  • Slower replay times
  • Stale hints that may conflict

Read Repair Probability

// Low traffic, strong consistency needed
ReadRepairManager(100);  // 100% - all reads trigger repair

// Balanced (recommended)
ReadRepairManager(10);   // 10% - good balance

// High traffic, eventual consistency OK
ReadRepairManager(1);    // 1% - minimal overhead

Monitoring and Verification

Check Hint Status

// Via shell
cube> STATS

Output:
Pending Hints:         45
Hints for node-3:      45

// Via API
GET /api/v1/replication/hints

Response:
{
  "totalHints": 45,
  "nodeHints": {
    "node-3": 45
  }
}

Verify Synchronization

// Read from all replicas
GET user:alice CL=ALL

// If all nodes return same value:
✓ Nodes are in sync

// If values differ:
⚠ Read repair will fix automatically

Manual Repair (if needed)

// Force full synchronization
POST /api/v1/replication/repair

{
  "keyspace": "users",
  "table": "profiles"
}

Summary

Data Synchronization Mechanisms:

  1. Hinted Handoff (Primary)

    • Stores missed writes while node is down
    • Automatically replays when node recovers
    • Fast and efficient for short outages
  2. Read Repair (Secondary)

    • Fixes inconsistencies during reads
    • Probabilistic (configurable %)
    • Catches anything hints missed
  3. Anti-Entropy Repair (Tertiary - Phase 3)

    • Full table scan and comparison
    • Expensive but comprehensive
    • Used for long outages

Recovery Timeline:

Node Down
   ↓
Writes continue (hints stored)
   ↓
Node Recovers
   ↓
Hints replayed (seconds to minutes)
   ↓
Read repair fixes edge cases (ongoing)
   ↓
Fully Synchronized ✓

Key Points:

Automatic - No manual intervention needed for short outages
Fast - Hints replay in seconds for normal outages
Reliable - Multiple layers ensure eventual consistency
Configurable - Tune hint windows and read repair probability

Your data stays synchronized even when nodes fail! 🎉