Multi-Server Cluster Deployment¶
This guide covers deploying Hafiz across multiple physical servers with shared PostgreSQL metadata.
Table of Contents¶
- Architecture Overview
- Quick Start: Adding a Second Node (Native/Bare Metal)
- Quick Start: Dual Cluster with Docker Compose
- Single-Network Cluster
- Adding Servers to Existing Cluster
- Cross-Network Replication
- What is Air-Gap?
- Unidirectional Replication (One-Way Sync)
- Air-Gapped System Replication
- Failover and Recovery
- Node Management API
Architecture Overview¶
┌─────────────────┐
│ Load Balancer │
│ (HAProxy/Nginx)│
│ dev-hafiz.e2e.lab
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
│ Hafiz Node │ │ Hafiz Node │ │ Hafiz Node │
│ 192.168.1.10 │ │ 192.168.1.11 │ │ 192.168.1.12 │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────▼───────┐
│ PostgreSQL │
│ 192.168.1.5 │
└───────────────┘
Quick Start: Adding a Second Node (Native/Bare Metal)¶
This section covers the simplest cluster setup: adding a second Hafiz node to an existing primary node, using native binaries (no Docker) and SQLite for metadata.
Two-Node Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Primary Node │ │ Secondary Node │ │
│ │ (Read/Write) │ ──Replication──▶│ (Replica) │ │
│ │ │ │ │ │
│ │ dev-hafiz.e2e.lab:9000 │ dev-hafiz-node1.e2e.lab:9000│
│ │ SQLite (local) │ │ SQLite (local) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Prerequisites¶
- Primary Hafiz node already running
- Secondary server with:
- Linux (Rocky Linux 8/9, RHEL 8/9, Ubuntu 20.04+)
- Network connectivity to primary node
- Port 9000 accessible (firewall opened)
- DNS A record pointing to the server (recommended)
Step 1: Copy the Binary to Secondary Node¶
On the primary node, copy the hafiz-server binary:
Or on the secondary node, pull from primary:
# On secondary node
mkdir -p /opt/hafiz
scp user@primary-node:/opt/hafiz/target/release/hafiz-server /opt/hafiz/
chmod +x /opt/hafiz/hafiz-server
Step 2: Run the Setup Script¶
Hafiz includes a setup script for secondary nodes:
# Download and customize the script
curl -O https://raw.githubusercontent.com/shellnoq/hafiz/main/scripts/setup-node2.sh
# Edit configuration variables at the top of the script
vim setup-node2.sh
# Run the script
chmod +x setup-node2.sh
sudo ./setup-node2.sh
The script will: 1. Create required directories 2. Generate the configuration file 3. Create a systemd service 4. Configure the firewall 5. Optionally start the service
Step 3: Manual Setup (Alternative)¶
If you prefer manual setup instead of using the script:
3a. Create Configuration File¶
mkdir -p /opt/hafiz/data
cat > /opt/hafiz/data/hafiz.toml << 'EOF'
[server]
bind_address = "0.0.0.0"
port = 9000
admin_port = 9001
workers = 0
max_connections = 10000
request_timeout_secs = 300
[storage]
data_dir = "/opt/hafiz/data"
temp_dir = "/tmp/hafiz"
max_object_size = 5497558138880
[database]
url = "sqlite:///opt/hafiz/data/hafiz.db?mode=rwc"
max_connections = 100
min_connections = 5
[auth]
enabled = true
root_access_key = "hafizadmin"
root_secret_key = "hafizadmin"
[logging]
level = "info"
format = "pretty"
[cluster]
enabled = true
name = "hafiz-cluster"
advertise_endpoint = "http://your-secondary-node.example.com:9000"
cluster_port = 9001
seed_nodes = ["http://your-primary-node.example.com:9000"]
heartbeat_interval_secs = 5
node_timeout_secs = 30
default_replication_mode = "async"
default_replication_factor = 2
cluster_tls_enabled = false
EOF
3b. Create Systemd Service¶
cat > /etc/systemd/system/hafiz.service << 'EOF'
[Unit]
Description=Hafiz S3 Storage Server
After=network.target
Documentation=https://github.com/shellnoq/hafiz
[Service]
Type=simple
User=your-user
Group=your-group
Environment="HAFIZ_CONFIG_FILE=/opt/hafiz/data/hafiz.toml"
ExecStart=/opt/hafiz/hafiz-server
Restart=always
RestartSec=5
WorkingDirectory=/opt/hafiz
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
3c. Open Firewall Ports¶
sudo firewall-cmd --add-port=9000/tcp --permanent
sudo firewall-cmd --add-port=9001/tcp --permanent
sudo firewall-cmd --reload
3d. Start the Service¶
Step 4: Register Node with Primary Cluster¶
After the secondary node is running, register it with the primary:
curl -X POST http://primary-node:9000/api/v1/cluster/nodes \
-H "Content-Type: application/json" \
-d '{
"endpoint": "http://secondary-node:9000",
"name": "secondary-node",
"role": "replica"
}'
Step 5: Create Replication Rules¶
Set up replication for your buckets:
# Create a replication rule for a bucket
curl -X POST http://primary-node:9000/api/v1/cluster/replication/rules \
-H "Content-Type: application/json" \
-d '{
"source_bucket": "my-bucket",
"destination_endpoint": "http://secondary-node:9000",
"mode": "async"
}'
Step 6: Verify Cluster and Replication¶
# Check cluster nodes
curl -s http://primary-node:9000/api/v1/cluster/nodes | jq .
# Check replication stats
curl -s http://primary-node:9000/api/v1/cluster/replication/stats | jq .
# Test replication - upload to primary
curl -X PUT http://primary-node:9000/my-bucket/test.txt -d "Hello World"
# Wait a moment, then verify on secondary
curl http://secondary-node:9000/my-bucket/test.txt
Troubleshooting¶
| Issue | Solution |
|---|---|
| Connection refused to secondary | Check firewall: firewall-cmd --list-ports |
| Config parse error on start | Ensure all required fields in hafiz.toml (see example config) |
| Permission denied on start | Check file ownership matches User in systemd service |
| Replication not working | Ensure bucket exists on secondary: curl -X PUT http://secondary:9000/bucket-name |
| Node not showing in cluster | Register manually via API (Step 4 above) |
Useful Commands¶
# View service logs
sudo journalctl -u hafiz -f
# Restart service
sudo systemctl restart hafiz
# Check if service is enabled for boot
sudo systemctl is-enabled hafiz
# Test connectivity from primary to secondary
curl http://secondary-node:9000/
# View cluster status
curl http://primary-node:9000/api/v1/cluster/status | jq .
Prerequisites¶
- 3+ servers with Rocky Linux 8/9 or RHEL 8/9
- PostgreSQL 13+ on a dedicated server (or managed service)
- Network connectivity between all nodes
- Domain name with DNS configured (e.g.,
dev-hafiz.e2e.lab)
Quick Start: Dual Cluster with Docker Compose¶
This section provides a complete step-by-step guide for deploying two Hafiz clusters with automatic synchronization.
Dual Cluster Architecture¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLUSTER A (Primary) │
│ Server: 192.168.1.100 │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ hafiz-node1 │ │ hafiz-node2 │ │ hafiz-node3 │ │ PostgreSQL │ │
│ │ :9000 │ │ :9010 │ │ :9020 │ │ :5432 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ │
│ └─────────────────┴─────────────────┴─────────────────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ HAProxy │ │
│ │ :80/:443 │ │
│ └───────────────┘ │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌──────────┴──────────┐
│ Network / VPN / │
│ Internet Link │
└──────────┬──────────┘
│
┌─────────────────────────────────────┴───────────────────────────────────────┐
│ CLUSTER B (Secondary) │
│ Server: 192.168.2.100 │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ hafiz-node1 │ │ hafiz-node2 │ │ hafiz-node3 │ │ PostgreSQL │ │
│ │ :9000 │ │ :9010 │ │ :9020 │ │ :5432 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ │
│ └─────────────────┴─────────────────┴─────────────────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ HAProxy │ │
│ │ :80/:443 │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Step 1: Deploy Cluster A (Primary)¶
On the first server (192.168.1.100):
# Clone the repository
git clone https://github.com/shellnoq/hafiz.git
cd hafiz
# Create environment file
cat > .env << 'EOF'
POSTGRES_PASSWORD=cluster_a_password_here
HAFIZ_ROOT_ACCESS_KEY=hafizadmin
HAFIZ_ROOT_SECRET_KEY=your_secret_key_here
EOF
# Build the Docker image
docker build -t hafiz:latest .
# Start Cluster A
docker compose -f docker-compose.cluster.yml up -d
# Verify all containers are running
docker ps
Expected output:
CONTAINER ID IMAGE COMMAND STATUS PORTS
xxxx hafiz:latest "/usr/bin/tini -- ha…" Up (healthy) 0.0.0.0:9000->9000/tcp
xxxx hafiz:latest "/usr/bin/tini -- ha…" Up (healthy) 0.0.0.0:9010->9000/tcp
xxxx hafiz:latest "/usr/bin/tini -- ha…" Up (healthy) 0.0.0.0:9020->9000/tcp
xxxx postgres:16-alpine "docker-entrypoint.s…" Up (healthy) 0.0.0.0:5432->5432/tcp
xxxx haproxy:2.9-alpine "docker-entrypoint.s…" Up 0.0.0.0:80->80/tcp
Step 2: Verify Cluster A is Working¶
# Check health endpoint
curl http://localhost:9000/health
# Access admin panel
echo "Admin Panel: http://192.168.1.100:9000/admin"
# Create a test bucket
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket
# Upload a test file
echo "Hello from Cluster A" > test.txt
aws --endpoint-url http://localhost:9000 s3 cp test.txt s3://test-bucket/
# List objects
aws --endpoint-url http://localhost:9000 s3 ls s3://test-bucket/
Step 3: Deploy Cluster B (Secondary)¶
On the second server (192.168.2.100):
# Clone the repository
git clone https://github.com/shellnoq/hafiz.git
cd hafiz
# Create environment file with DIFFERENT credentials
cat > .env << 'EOF'
POSTGRES_PASSWORD=cluster_b_password_here
HAFIZ_ROOT_ACCESS_KEY=hafizadmin
HAFIZ_ROOT_SECRET_KEY=your_secret_key_here
EOF
# Build the Docker image
docker build -t hafiz:latest .
# Start Cluster B
docker compose -f docker-compose.cluster.yml up -d
# Verify all containers are running
docker ps
Step 4: Configure Cluster Synchronization¶
There are three options for synchronizing data between clusters:
Option A: PostgreSQL Logical Replication (Recommended for Metadata)¶
This keeps metadata (buckets, users, policies) synchronized in real-time.
On Cluster A (Primary) - PostgreSQL Container:
# Connect to PostgreSQL
docker exec -it hafiz-postgres psql -U hafiz -d hafiz
# Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 4;
ALTER SYSTEM SET max_wal_senders = 4;
# Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'repl_secure_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;
# Create publication for all tables
CREATE PUBLICATION hafiz_replication FOR ALL TABLES;
# Exit and restart PostgreSQL
\q
On Cluster B (Secondary) - PostgreSQL Container:
# Connect to PostgreSQL
docker exec -it hafiz-postgres psql -U hafiz -d hafiz
# Create subscription to Cluster A
CREATE SUBSCRIPTION hafiz_subscription
CONNECTION 'host=192.168.1.100 port=5432 dbname=hafiz user=replicator password=repl_secure_password'
PUBLICATION hafiz_replication;
# Verify subscription is active
SELECT * FROM pg_stat_subscription;
\q
Option B: Object Data Synchronization with rclone¶
For synchronizing actual object files between clusters:
# Install rclone on both servers
curl https://rclone.org/install.sh | sudo bash
# Configure Cluster A as source
rclone config create cluster_a s3 \
provider=Other \
endpoint=http://192.168.1.100:9000 \
access_key_id=hafizadmin \
secret_access_key=your_secret_key_here
# Configure Cluster B as target
rclone config create cluster_b s3 \
provider=Other \
endpoint=http://192.168.2.100:9000 \
access_key_id=hafizadmin \
secret_access_key=your_secret_key_here
# Test sync (dry run first)
rclone sync cluster_a:test-bucket cluster_b:test-bucket --dry-run
# Run actual sync
rclone sync cluster_a:test-bucket cluster_b:test-bucket --progress
# Verify on Cluster B
aws --endpoint-url http://192.168.2.100:9000 s3 ls s3://test-bucket/
Automated Sync with Cron:
# Create sync script
cat > /opt/hafiz/sync-clusters.sh << 'EOF'
#!/bin/bash
# Sync all buckets from Cluster A to Cluster B
LOG_FILE="/var/log/hafiz-sync.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$TIMESTAMP] Starting cluster sync..." >> $LOG_FILE
# Get list of buckets from Cluster A
BUCKETS=$(rclone lsd cluster_a: | awk '{print $5}')
for bucket in $BUCKETS; do
echo "[$TIMESTAMP] Syncing bucket: $bucket" >> $LOG_FILE
rclone sync cluster_a:$bucket cluster_b:$bucket \
--checksum \
--transfers 4 \
--checkers 8 \
2>> $LOG_FILE
done
echo "[$TIMESTAMP] Sync completed." >> $LOG_FILE
EOF
chmod +x /opt/hafiz/sync-clusters.sh
# Add to crontab (every 5 minutes)
(crontab -l 2>/dev/null; echo "*/5 * * * * /opt/hafiz/sync-clusters.sh") | crontab -
Option C: Real-Time Bidirectional Sync¶
For active-active clusters where both can accept writes:
# Create bidirectional sync script
cat > /opt/hafiz/bidirectional-sync.sh << 'EOF'
#!/bin/bash
# Bidirectional sync between clusters
LOG_FILE="/var/log/hafiz-bisync.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$TIMESTAMP] Starting bidirectional sync..." >> $LOG_FILE
# Get all buckets from both clusters
BUCKETS_A=$(rclone lsd cluster_a: 2>/dev/null | awk '{print $5}')
BUCKETS_B=$(rclone lsd cluster_b: 2>/dev/null | awk '{print $5}')
ALL_BUCKETS=$(echo -e "$BUCKETS_A\n$BUCKETS_B" | sort -u)
for bucket in $ALL_BUCKETS; do
echo "[$TIMESTAMP] Bidirectional sync: $bucket" >> $LOG_FILE
# Sync A -> B (new files from A)
rclone copy cluster_a:$bucket cluster_b:$bucket \
--checksum --update 2>> $LOG_FILE
# Sync B -> A (new files from B)
rclone copy cluster_b:$bucket cluster_a:$bucket \
--checksum --update 2>> $LOG_FILE
done
echo "[$TIMESTAMP] Bidirectional sync completed." >> $LOG_FILE
EOF
chmod +x /opt/hafiz/bidirectional-sync.sh
# Run every minute for near real-time sync
(crontab -l 2>/dev/null; echo "* * * * * /opt/hafiz/bidirectional-sync.sh") | crontab -
Step 5: Verify Synchronization¶
Test metadata sync (if using PostgreSQL replication):
# Create user on Cluster A
curl -X POST http://192.168.1.100:9000/api/v1/users \
-H "Content-Type: application/json" \
-d '{"name": "testuser", "email": "test@example.com"}'
# Verify user appears on Cluster B (should sync within seconds)
curl http://192.168.2.100:9000/api/v1/users
Test object sync:
# Upload to Cluster A
aws --endpoint-url http://192.168.1.100:9000 s3 cp myfile.txt s3://test-bucket/
# Wait for sync (based on your cron interval)
sleep 60
# Verify on Cluster B
aws --endpoint-url http://192.168.2.100:9000 s3 ls s3://test-bucket/
aws --endpoint-url http://192.168.2.100:9000 s3 cp s3://test-bucket/myfile.txt -
Step 6: Set Up Load Balancer for Failover¶
Configure HAProxy or nginx to route traffic with automatic failover:
# /etc/haproxy/haproxy-global.cfg
global
log /dev/log local0
maxconn 4096
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
option httpchk GET /health
frontend hafiz_frontend
bind *:9000
default_backend hafiz_clusters
backend hafiz_clusters
balance roundrobin
option httpchk GET /health
# Cluster A nodes (primary)
server cluster_a_node1 192.168.1.100:9000 check weight 100
server cluster_a_node2 192.168.1.100:9010 check weight 100
# Cluster B nodes (backup)
server cluster_b_node1 192.168.2.100:9000 check backup
server cluster_b_node2 192.168.2.100:9010 check backup
Step 7: Monitoring and Health Checks¶
Create health check script:
cat > /opt/hafiz/health-check.sh << 'EOF'
#!/bin/bash
CLUSTERS=("192.168.1.100:9000" "192.168.2.100:9000")
WEBHOOK_URL="https://your-slack-webhook-url"
for cluster in "${CLUSTERS[@]}"; do
STATUS=$(curl -sf "http://$cluster/health" | jq -r '.status' 2>/dev/null)
if [ "$STATUS" != "healthy" ]; then
echo "ALERT: Cluster $cluster is unhealthy!"
# Send alert (Slack example)
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"⚠️ Hafiz cluster $cluster is unhealthy!\"}" \
$WEBHOOK_URL
fi
done
EOF
chmod +x /opt/hafiz/health-check.sh
# Check every minute
(crontab -l 2>/dev/null; echo "* * * * * /opt/hafiz/health-check.sh") | crontab -
Troubleshooting Dual Cluster Setup¶
| Issue | Solution |
|---|---|
| PostgreSQL replication lag | Check network latency, increase wal_sender_timeout |
| Objects not syncing | Verify rclone config with rclone lsd cluster_a: |
| Connection refused | Check firewall rules: ports 5432, 9000 |
| Authentication failed | Verify credentials in .env files match |
| Containers unhealthy | Check logs: docker logs hafiz-node1 |
View sync logs:
Check PostgreSQL replication status:
Step 1: PostgreSQL Setup¶
On the PostgreSQL server (192.168.1.5):
# Install PostgreSQL
sudo dnf install -y postgresql16-server postgresql16
# Initialize and start
sudo postgresql-setup --initdb
sudo systemctl enable --now postgresql
# Create database and user
sudo -u postgres psql << 'EOF'
CREATE USER hafiz WITH PASSWORD 'your_strong_password';
CREATE DATABASE hafiz OWNER hafiz;
GRANT ALL PRIVILEGES ON DATABASE hafiz TO hafiz;
EOF
# Allow remote connections - edit pg_hba.conf
echo "host hafiz hafiz 192.168.1.0/24 md5" | sudo tee -a /var/lib/pgsql/16/data/pg_hba.conf
# Edit postgresql.conf
sudo sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/" /var/lib/pgsql/16/data/postgresql.conf
# Restart PostgreSQL
sudo systemctl restart postgresql
# Open firewall
sudo firewall-cmd --permanent --add-port=5432/tcp
sudo firewall-cmd --reload
Step 2: Deploy Hafiz Nodes¶
On each Hafiz server (192.168.1.10, .11, .12):
# Download deployment script
curl -O https://raw.githubusercontent.com/shellnoq/hafiz/main/deploy/rocky/deploy.sh
chmod +x deploy.sh
# Set environment variables
export HAFIZ_DATABASE_URL="postgresql://hafiz:your_strong_password@192.168.1.5:5432/hafiz"
export HAFIZ_ROOT_ACCESS_KEY="hafizadmin"
export HAFIZ_ROOT_SECRET_KEY="your_secret_key"
export HAFIZ_S3_BIND="0.0.0.0"
export HAFIZ_S3_PORT="9000"
# Run single node (not cluster compose)
sudo ./deploy.sh single
Or with Docker directly:
docker run -d \
--name hafiz \
--restart unless-stopped \
-p 9000:9000 \
-v /data/hafiz:/data \
-e HAFIZ_DATABASE_URL="postgresql://hafiz:your_password@192.168.1.5:5432/hafiz" \
-e HAFIZ_ROOT_ACCESS_KEY="hafizadmin" \
-e HAFIZ_ROOT_SECRET_KEY="your_secret_key" \
hafiz:latest
Step 3: Load Balancer Setup¶
Option A: HAProxy (Recommended)¶
On the load balancer server:
Edit /etc/haproxy/haproxy.cfg:
global
log /dev/log local0
maxconn 4096
user haproxy
group haproxy
daemon
defaults
mode http
log global
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend hafiz_frontend
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/hafiz.pem
redirect scheme https code 301 if !{ ssl_fc }
default_backend hafiz_backend
backend hafiz_backend
balance roundrobin
option httpchk GET /metrics
http-check expect status 200
server node1 192.168.1.10:9000 check
server node2 192.168.1.11:9000 check
server node3 192.168.1.12:9000 check
# Stats page
listen stats
bind *:8404
stats enable
stats uri /stats
stats auth admin:your_stats_password
sudo systemctl enable --now haproxy
sudo firewall-cmd --permanent --add-service=http --add-service=https
sudo firewall-cmd --reload
Option B: Nginx¶
upstream hafiz_cluster {
least_conn;
server 192.168.1.10:9000 weight=1;
server 192.168.1.11:9000 weight=1;
server 192.168.1.12:9000 weight=1;
}
server {
listen 80;
server_name dev-hafiz.e2e.lab;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name dev-hafiz.e2e.lab;
ssl_certificate /etc/nginx/ssl/hafiz.crt;
ssl_certificate_key /etc/nginx/ssl/hafiz.key;
client_max_body_size 5G;
location / {
proxy_pass http://hafiz_cluster;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# For large file uploads
proxy_request_buffering off;
proxy_buffering off;
}
}
Step 4: SSL Certificate¶
Let's Encrypt (Production)¶
sudo dnf install -y certbot
sudo certbot certonly --standalone -d dev-hafiz.e2e.lab
# For HAProxy, combine cert and key
sudo cat /etc/letsencrypt/live/dev-hafiz.e2e.lab/fullchain.pem \
/etc/letsencrypt/live/dev-hafiz.e2e.lab/privkey.pem \
> /etc/haproxy/certs/hafiz.pem
Self-Signed (Development)¶
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/haproxy/certs/hafiz.key \
-out /etc/haproxy/certs/hafiz.crt \
-subj "/CN=dev-hafiz.e2e.lab"
cat /etc/haproxy/certs/hafiz.crt /etc/haproxy/certs/hafiz.key > /etc/haproxy/certs/hafiz.pem
Step 5: DNS Configuration¶
Add DNS records:
Or add to /etc/hosts on client machines:
Verification¶
# Test S3 API
aws --endpoint-url https://dev-hafiz.e2e.lab s3 ls
# Create bucket
aws --endpoint-url https://dev-hafiz.e2e.lab s3 mb s3://mybucket
# Upload file
aws --endpoint-url https://dev-hafiz.e2e.lab s3 cp test.txt s3://mybucket/
# Access Admin UI
open https://dev-hafiz.e2e.lab/admin
Monitoring¶
All nodes export Prometheus metrics at /metrics:
Add to Prometheus:
scrape_configs:
- job_name: 'hafiz'
static_configs:
- targets:
- '192.168.1.10:9000'
- '192.168.1.11:9000'
- '192.168.1.12:9000'
Troubleshooting¶
Nodes not syncing¶
All nodes must connect to the same PostgreSQL database. Verify:
Connection refused¶
Check firewall rules:
Load balancer health checks failing¶
Adding Servers to Existing Cluster¶
When scaling your Hafiz cluster, you can add new servers at any time. All nodes share the same PostgreSQL database, so new nodes automatically have access to all metadata.
Step 1: Prepare the New Server¶
On the new server (e.g., 192.168.1.13):
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo systemctl enable --now docker
# Open firewall ports
sudo firewall-cmd --permanent --add-port=9000/tcp
sudo firewall-cmd --reload
Step 2: Test PostgreSQL Connectivity¶
# Test connection to shared PostgreSQL
psql postgresql://hafiz:your_password@192.168.1.5:5432/hafiz -c "SELECT 1"
Step 3: Deploy Hafiz Node¶
# Deploy with the same configuration as existing nodes
docker run -d \
--name hafiz \
--restart unless-stopped \
-p 9000:9000 \
-v /data/hafiz:/data \
-e HAFIZ_DATABASE_URL="postgresql://hafiz:your_password@192.168.1.5:5432/hafiz" \
-e HAFIZ_ROOT_ACCESS_KEY="hafizadmin" \
-e HAFIZ_ROOT_SECRET_KEY="your_secret_key" \
-e HAFIZ_STORAGE_BASE_PATH="/data/objects" \
hafiz:latest
Step 4: Update Load Balancer¶
Add the new server to HAProxy:
backend hafiz_backend
balance roundrobin
option httpchk GET /metrics
http-check expect status 200
server node1 192.168.1.10:9000 check
server node2 192.168.1.11:9000 check
server node3 192.168.1.12:9000 check
server node4 192.168.1.13:9000 check # New node
Reload HAProxy:
Step 5: Verify the New Node¶
# Check node health
curl http://192.168.1.13:9000/metrics
# Verify in HAProxy stats
curl http://localhost:8404/stats
Adding Multiple Nodes Simultaneously¶
You can add multiple servers at once using a deployment script:
#!/bin/bash
# deploy-nodes.sh
NODES=("192.168.1.14" "192.168.1.15" "192.168.1.16")
POSTGRES_URL="postgresql://hafiz:your_password@192.168.1.5:5432/hafiz"
ACCESS_KEY="hafizadmin"
SECRET_KEY="your_secret_key"
for node in "${NODES[@]}"; do
echo "Deploying to $node..."
ssh root@$node "docker run -d \
--name hafiz \
--restart unless-stopped \
-p 9000:9000 \
-v /data/hafiz:/data \
-e HAFIZ_DATABASE_URL='$POSTGRES_URL' \
-e HAFIZ_ROOT_ACCESS_KEY='$ACCESS_KEY' \
-e HAFIZ_ROOT_SECRET_KEY='$SECRET_KEY' \
-e HAFIZ_STORAGE_BASE_PATH='/data/objects' \
hafiz:latest"
done
echo "All nodes deployed!"
Cross-Network Replication¶
Connect Hafiz clusters across different networks (data centers, cloud regions, or office locations) for disaster recovery and geographic distribution.
Architecture: Cross-Network Setup¶
┌─────────────────────────────────────────────────────────────────────┐
│ Network A (Primary) │
│ (192.168.1.0/24) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node A1 │ │ Node A2 │ │ PostgreSQL │ │
│ │ 192.168.1.10│ │ 192.168.1.11│ │ 192.168.1.5 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └───────────────────┴───────────────────┘ │
│ │ │
└──────────────────────────────┼───────────────────────────────────────┘
│
┌──────────┴──────────┐
│ VPN / WireGuard │
│ or Site-to-Site │
└──────────┬──────────┘
│
┌──────────────────────────────┼───────────────────────────────────────┐
│ │ │
│ ┌──────────────────────────┴──────────────────────────┐ │
│ │ │ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node B1 │ │ Node B2 │ │PostgreSQL │ │
│ │ 10.0.1.10 │ │ 10.0.1.11 │ │ (Replica) │ │
│ └─────────────┘ └─────────────┘ │ 10.0.1.5 │ │
│ └─────────────┘ │
│ │
│ Network B (Secondary) │
│ (10.0.1.0/24) │
└─────────────────────────────────────────────────────────────────────┘
Option 1: PostgreSQL Logical Replication¶
Use PostgreSQL's built-in logical replication for real-time metadata sync.
Primary Site (Network A)¶
# Edit postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
# Create replication user
sudo -u postgres psql << 'EOF'
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'repl_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
CREATE PUBLICATION hafiz_pub FOR ALL TABLES;
EOF
# Update pg_hba.conf for remote access
echo "host hafiz replication_user 10.0.1.0/24 md5" >> pg_hba.conf
# Restart PostgreSQL
sudo systemctl restart postgresql
Secondary Site (Network B)¶
# Create database (without initial data)
sudo -u postgres psql << 'EOF'
CREATE DATABASE hafiz;
CREATE SUBSCRIPTION hafiz_sub
CONNECTION 'host=192.168.1.5 port=5432 dbname=hafiz user=replication_user password=repl_password'
PUBLICATION hafiz_pub;
EOF
Deploy Secondary Nodes¶
# On secondary nodes, connect to local PostgreSQL replica
docker run -d \
--name hafiz \
--restart unless-stopped \
-p 9000:9000 \
-v /data/hafiz:/data \
-e HAFIZ_DATABASE_URL="postgresql://hafiz:password@10.0.1.5:5432/hafiz" \
-e HAFIZ_ROOT_ACCESS_KEY="hafizadmin" \
-e HAFIZ_ROOT_SECRET_KEY="your_secret_key" \
hafiz:latest
Option 2: Active-Active with HAProxy¶
For active-active setup with both sites serving traffic:
Global HAProxy Configuration¶
global
log /dev/log local0
maxconn 4096
defaults
mode http
timeout connect 5000
timeout client 30000
timeout server 30000
# Primary site backend
backend site_a
balance roundrobin
option httpchk GET /metrics
server a1 192.168.1.10:9000 check
server a2 192.168.1.11:9000 check
# Secondary site backend
backend site_b
balance roundrobin
option httpchk GET /metrics
server b1 10.0.1.10:9000 check
server b2 10.0.1.11:9000 check
# Active-active frontend with failover
frontend hafiz_global
bind *:9000
# Health-based routing
acl site_a_up nbsrv(site_a) gt 0
acl site_b_up nbsrv(site_b) gt 0
# Prefer primary site
use_backend site_a if site_a_up
use_backend site_b if site_b_up !site_a_up
default_backend site_a
Option 3: S3 Bucket Replication¶
Replicate object data between clusters using S3-compatible replication:
#!/bin/bash
# sync-clusters.sh - Run via cron every 5 minutes
SOURCE_ENDPOINT="https://hafiz-primary.example.com"
TARGET_ENDPOINT="https://hafiz-secondary.example.com"
BUCKETS=("important-data" "backups" "archives")
for bucket in "${BUCKETS[@]}"; do
echo "Syncing $bucket..."
aws --endpoint-url $SOURCE_ENDPOINT s3 sync \
s3://$bucket /tmp/sync-$bucket/ \
--delete
aws --endpoint-url $TARGET_ENDPOINT s3 sync \
/tmp/sync-$bucket/ s3://$bucket \
--delete
rm -rf /tmp/sync-$bucket/
done
For real-time replication, use rclone with checksums:
# Install rclone
curl https://rclone.org/install.sh | sudo bash
# Configure endpoints
rclone config create primary s3 \
provider=Other \
endpoint=https://hafiz-primary.example.com \
access_key_id=hafizadmin \
secret_access_key=your_secret
rclone config create secondary s3 \
provider=Other \
endpoint=https://hafiz-secondary.example.com \
access_key_id=hafizadmin \
secret_access_key=your_secret
# Real-time sync with inotify
rclone sync primary:mybucket secondary:mybucket --checksum --progress
Network Requirements¶
| Protocol | Port | Direction | Purpose |
|---|---|---|---|
| TCP | 5432 | Bidirectional | PostgreSQL replication |
| TCP | 9000 | Bidirectional | S3 API traffic |
| UDP | 51820 | Bidirectional | WireGuard VPN |
| TCP | 22 | Bidirectional | SSH management |
WireGuard VPN Setup¶
For secure cross-network connectivity:
Site A (Primary)¶
# Install WireGuard
sudo dnf install -y wireguard-tools
# Generate keys
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
# Configure /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <site_a_private_key>
Address = 10.200.0.1/24
ListenPort = 51820
[Peer]
PublicKey = <site_b_public_key>
AllowedIPs = 10.200.0.2/32, 10.0.1.0/24
Endpoint = site-b-public-ip:51820
PersistentKeepalive = 25
Site B (Secondary)¶
# Configure /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <site_b_private_key>
Address = 10.200.0.2/24
ListenPort = 51820
[Peer]
PublicKey = <site_a_public_key>
AllowedIPs = 10.200.0.1/32, 192.168.1.0/24
Endpoint = site-a-public-ip:51820
PersistentKeepalive = 25
What is Air-Gap?¶
An air-gapped network is a security measure where a computer or network is physically isolated from other networks, including the internet. There is no wired or wireless connection between the air-gapped system and any other network.
Why Use Air-Gap?¶
| Use Case | Description |
|---|---|
| Classified Networks | Military, intelligence, and government systems handling sensitive data |
| Critical Infrastructure | Power grids, water treatment, nuclear facilities |
| Financial Systems | High-security trading systems, core banking |
| Healthcare | Patient data isolation, medical device networks |
| Research Labs | Protecting intellectual property and sensitive research |
| Disaster Recovery | Offline backup sites immune to ransomware attacks |
How Hafiz Supports Air-Gap¶
Hafiz provides complete air-gap support through:
- Export/Import Tools: Scripts to export all data (metadata + objects) to physical media
- Checksum Verification: SHA-256 checksums at every level for data integrity
- Incremental Sync: Export only changed objects since last sync
- Encrypted Media Support: Works with LUKS-encrypted USB drives
- Audit Trail: Complete logging of all export/import operations
Air-Gap Workflow Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ AIR-GAP WORKFLOW │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐│
│ │ Source │───▶│ Export │───▶│ Physical │───▶│ Import ││
│ │ Cluster │ │ Server │ │ Transfer │ │ Server ││
│ │ (Read-Write) │ │ │ │ (USB/Tape) │ │ ││
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────┬──────┘│
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │ Target │ │
│ │ Cluster │ │
│ │ (Read-Only)│ │
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Unidirectional Replication (One-Way Sync)¶
Hafiz supports unidirectional replication where one cluster is read-write (primary) and the other is read-only (replica). This is ideal for:
- Disaster Recovery: Main site writes, DR site receives copies
- Content Distribution: Central site publishes, edge sites consume
- Regulatory Compliance: Write to secure location, replicate to reporting systems
- Air-Gapped Backup: Secure network writes, isolated network receives
Replication Direction Modes¶
| Mode | Source | Destination | Use Case |
|---|---|---|---|
Bidirectional |
Read-Write | Read-Write | Active-active clusters |
SourceToDestination |
Read-Write | Read-Only | Primary/replica setup |
DestinationToSource |
Read-Only | Read-Write | Reverse flow setup |
Configuring Unidirectional Replication via API¶
Create a One-Way Replication Rule¶
# Create replication rule: Primary → Replica (one-way)
curl -X POST http://localhost:9000/admin/cluster/replication/rules \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)" \
-H "Content-Type: application/json" \
-d '{
"id": "primary-to-dr",
"source_bucket": "production-data",
"destination_bucket": "production-data",
"destination_endpoint": "http://dr-site.example.com:9000",
"destination_access_key": "dr_access_key",
"destination_secret_key": "dr_secret_key",
"direction": "SourceToDestination",
"status": "enabled",
"priority": 1,
"filter_prefix": ""
}'
Check Replication Rule¶
curl -X GET http://localhost:9000/admin/cluster/replication/rules/primary-to-dr \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)"
Response:
{
"id": "primary-to-dr",
"source_bucket": "production-data",
"destination_bucket": "production-data",
"destination_endpoint": "http://dr-site.example.com:9000",
"direction": "SourceToDestination",
"status": "enabled",
"priority": 1
}
Configuring Read-Only Replica¶
On the destination cluster, configure the bucket to reject writes:
Option 1: User-Level Permission (Recommended)¶
Create users on the replica site with read-only bucket access:
# Create read-only user for replica bucket
curl -X POST http://dr-site:9000/admin/users \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)" \
-H "Content-Type: application/json" \
-d '{
"name": "replica-reader",
"description": "Read-only user for replicated data",
"bucket_access": [
{"bucket": "production-data", "permission": "read"}
]
}'
Response:
{
"name": "replica-reader",
"access_key": "AKIAXXXXXXXXXXXXXXXX",
"secret_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"description": "Read-only user for replicated data",
"bucket_access": [
{"bucket": "production-data", "permission": "read"}
]
}
Option 2: Replication Service Account¶
The replication service uses a dedicated account with write access, while all other users have read-only access:
# Replication service account (has write for sync)
curl -X POST http://dr-site:9000/admin/users \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)" \
-H "Content-Type: application/json" \
-d '{
"name": "replication-service",
"description": "Internal replication service - write access for sync only",
"bucket_access": [
{"bucket": "production-data", "permission": "write"}
]
}'
# Regular users get read-only
curl -X POST http://dr-site:9000/admin/users \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)" \
-H "Content-Type: application/json" \
-d '{
"name": "app-user",
"description": "Application user - read only on DR site",
"bucket_access": [
{"bucket": "production-data", "permission": "read"}
]
}'
Architecture: Unidirectional Replication¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ PRIMARY SITE (Read-Write) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ App Server │────▶│ Hafiz │────▶│ PostgreSQL │ │
│ │ (Writes) │ │ Primary │ │ Primary │ │
│ └─────────────┘ └──────┬──────┘ └─────────────┘ │
│ │ │
│ │ Replication Events │
│ ▼ │
│ ┌────────────────┐ │
│ │ Replication │ │
│ │ Service │ │
│ │ (direction: │ │
│ │ source→dest) │ │
│ └────────┬───────┘ │
└──────────────────────────────┼──────────────────────────────────────────────┘
│
Objects + Metadata
│
▼
┌──────────────────────────────┼──────────────────────────────────────────────┐
│ │ │
│ ┌────────▼───────┐ │
│ │ Replication │ │
│ │ Receiver │ │
│ │ (write-only │ │
│ │ service) │ │
│ └────────┬───────┘ │
│ │ │
│ ┌─────────────┐ ┌──────▼──────┐ ┌─────────────┐ │
│ │ App Server │────▶│ Hafiz │────▶│ PostgreSQL │ │
│ │ (Read Only) │ │ Replica │ │ Replica │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ REPLICA SITE (Read-Only for Users) │
└─────────────────────────────────────────────────────────────────────────────┘
Monitoring Unidirectional Replication¶
# Check replication lag
curl http://primary:9000/admin/cluster/replication/stats \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)"
Response:
{
"events_processed": 15420,
"successful": 15418,
"failed": 2,
"pending": 5,
"in_progress": 1,
"bytes_replicated": 1073741824,
"avg_latency_ms": 45.2,
"rules": [
{
"id": "primary-to-dr",
"direction": "SourceToDestination",
"pending_events": 5,
"last_sync": "2024-01-15T12:30:00Z"
}
]
}
Failover to Read-Only Replica¶
If you need to promote the replica to primary:
# 1. Stop replication rule
curl -X DELETE http://primary:9000/admin/cluster/replication/rules/primary-to-dr \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)"
# 2. Update user permissions on replica to allow writes
curl -X PUT http://dr-site:9000/admin/users/app-user/buckets \
-H "Authorization: Basic $(echo -n 'hafizadmin:secret' | base64)" \
-H "Content-Type: application/json" \
-d '{
"bucket_access": [
{"bucket": "production-data", "permission": "readwrite"}
]
}'
# 3. Update DNS to point to replica
# 4. Start accepting traffic on replica
Air-Gapped System Replication¶
For environments with no network connectivity (classified networks, secure facilities, disaster recovery sites), Hafiz supports offline data transfer. See What is Air-Gap? for more context.
Architecture: Air-Gapped Setup¶
┌─────────────────────────────────────────────────────────────────────┐
│ Secure Network (Source) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Hafiz A1 │ │ Hafiz A2 │ │ PostgreSQL │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └───────────────────┴───────────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Export Server │ │
│ │ (Data Diode Out) │ │
│ └─────────┬─────────┘ │
└──────────────────────────────┼───────────────────────────────────────┘
│
┌──────────▼──────────┐
│ Physical Media │
│ (USB/Tape/Drive) │
└──────────┬──────────┘
│
┌──────────────────────────────┼───────────────────────────────────────┐
│ ┌─────────▼─────────┐ │
│ │ Import Server │ │
│ │ (Data Diode In) │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Hafiz B1 │ │ Hafiz B2 │ │ PostgreSQL │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Air-Gapped Network (Target) │
└─────────────────────────────────────────────────────────────────────┘
Export Data from Source Cluster¶
1. Create Export Script¶
#!/bin/bash
# hafiz-export.sh - Export Hafiz data for air-gapped transfer
set -e
EXPORT_DIR="/mnt/export/hafiz-$(date +%Y%m%d-%H%M%S)"
POSTGRES_URL="postgresql://hafiz:password@localhost:5432/hafiz"
S3_ENDPOINT="http://localhost:9000"
mkdir -p "$EXPORT_DIR"/{metadata,objects,checksums}
echo "=== Hafiz Air-Gapped Export ==="
echo "Export directory: $EXPORT_DIR"
echo "Timestamp: $(date -Iseconds)"
# 1. Export PostgreSQL metadata
echo "Exporting PostgreSQL metadata..."
pg_dump "$POSTGRES_URL" \
--format=custom \
--file="$EXPORT_DIR/metadata/hafiz.dump"
# Export as SQL for verification
pg_dump "$POSTGRES_URL" \
--format=plain \
--file="$EXPORT_DIR/metadata/hafiz.sql"
# 2. Export bucket list
echo "Exporting bucket list..."
aws --endpoint-url $S3_ENDPOINT s3 ls > "$EXPORT_DIR/metadata/buckets.txt"
# 3. Export objects for each bucket
echo "Exporting objects..."
while read -r _ _ _ bucket; do
echo " Bucket: $bucket"
mkdir -p "$EXPORT_DIR/objects/$bucket"
# Sync all objects
aws --endpoint-url $S3_ENDPOINT s3 sync \
"s3://$bucket" "$EXPORT_DIR/objects/$bucket/" \
--no-progress
# Create manifest
find "$EXPORT_DIR/objects/$bucket" -type f -exec sha256sum {} \; \
> "$EXPORT_DIR/checksums/$bucket.sha256"
done < "$EXPORT_DIR/metadata/buckets.txt"
# 4. Create master checksum
echo "Creating master checksums..."
find "$EXPORT_DIR" -type f -exec sha256sum {} \; \
> "$EXPORT_DIR/CHECKSUMS.sha256"
# 5. Create export manifest
cat > "$EXPORT_DIR/MANIFEST.json" << EOF
{
"export_type": "hafiz_airgap_export",
"version": "1.0",
"timestamp": "$(date -Iseconds)",
"source_cluster": "$(hostname)",
"postgres_version": "$(psql --version | head -1)",
"total_buckets": $(wc -l < "$EXPORT_DIR/metadata/buckets.txt"),
"total_size_bytes": $(du -sb "$EXPORT_DIR" | cut -f1)
}
EOF
# 6. Create archive (optional)
echo "Creating archive..."
cd "$(dirname $EXPORT_DIR)"
tar -cvf "$(basename $EXPORT_DIR).tar" "$(basename $EXPORT_DIR)"
# 7. Calculate final checksums
sha256sum "$(basename $EXPORT_DIR).tar" > "$(basename $EXPORT_DIR).tar.sha256"
echo ""
echo "=== Export Complete ==="
echo "Archive: $EXPORT_DIR.tar"
echo "Checksum: $EXPORT_DIR.tar.sha256"
echo "Size: $(du -sh $EXPORT_DIR.tar | cut -f1)"
2. Run Export¶
3. Transfer to Removable Media¶
# Mount encrypted USB drive
sudo cryptsetup luksOpen /dev/sdb1 secure_usb
sudo mount /dev/mapper/secure_usb /mnt/usb
# Copy export
cp /mnt/export/hafiz-*.tar /mnt/usb/
cp /mnt/export/hafiz-*.sha256 /mnt/usb/
# Unmount and lock
sudo umount /mnt/usb
sudo cryptsetup luksClose secure_usb
Import Data to Target Cluster¶
1. Create Import Script¶
#!/bin/bash
# hafiz-import.sh - Import Hafiz data from air-gapped transfer
set -e
IMPORT_FILE="$1"
POSTGRES_URL="postgresql://hafiz:password@localhost:5432/hafiz"
S3_ENDPOINT="http://localhost:9000"
if [ -z "$IMPORT_FILE" ]; then
echo "Usage: $0 <hafiz-export.tar>"
exit 1
fi
echo "=== Hafiz Air-Gapped Import ==="
echo "Import file: $IMPORT_FILE"
echo "Timestamp: $(date -Iseconds)"
# 1. Verify checksum
echo "Verifying archive checksum..."
if ! sha256sum -c "$IMPORT_FILE.sha256"; then
echo "ERROR: Checksum verification failed!"
exit 1
fi
# 2. Extract archive
IMPORT_DIR="/tmp/hafiz-import-$$"
mkdir -p "$IMPORT_DIR"
tar -xvf "$IMPORT_FILE" -C "$IMPORT_DIR"
EXPORT_DIR=$(ls "$IMPORT_DIR")
# 3. Verify internal checksums
echo "Verifying internal checksums..."
cd "$IMPORT_DIR/$EXPORT_DIR"
if ! sha256sum -c CHECKSUMS.sha256; then
echo "ERROR: Internal checksum verification failed!"
exit 1
fi
# 4. Import PostgreSQL metadata
echo "Importing PostgreSQL metadata..."
# Note: This overwrites existing data - backup first!
pg_restore \
--dbname="$POSTGRES_URL" \
--clean \
--if-exists \
--no-owner \
"$IMPORT_DIR/$EXPORT_DIR/metadata/hafiz.dump"
# 5. Import objects
echo "Importing objects..."
for bucket_dir in "$IMPORT_DIR/$EXPORT_DIR/objects/"*/; do
bucket=$(basename "$bucket_dir")
echo " Bucket: $bucket"
# Create bucket if not exists
aws --endpoint-url $S3_ENDPOINT s3 mb "s3://$bucket" 2>/dev/null || true
# Sync objects
aws --endpoint-url $S3_ENDPOINT s3 sync \
"$bucket_dir" "s3://$bucket/" \
--no-progress
done
# 6. Verify import
echo "Verifying import..."
aws --endpoint-url $S3_ENDPOINT s3 ls
# 7. Cleanup
rm -rf "$IMPORT_DIR"
echo ""
echo "=== Import Complete ==="
echo "Imported from: $IMPORT_FILE"
echo "Timestamp: $(date -Iseconds)"
2. Run Import¶
# Mount USB
sudo cryptsetup luksOpen /dev/sdb1 secure_usb
sudo mount /dev/mapper/secure_usb /mnt/usb
# Verify and import
chmod +x hafiz-import.sh
sudo ./hafiz-import.sh /mnt/usb/hafiz-20240115-120000.tar
# Cleanup
sudo umount /mnt/usb
sudo cryptsetup luksClose secure_usb
Incremental Air-Gapped Sync¶
For ongoing synchronization, export only changes since last sync:
#!/bin/bash
# hafiz-incremental-export.sh
LAST_SYNC_FILE="/var/lib/hafiz/last_airgap_sync"
LAST_SYNC=$(cat "$LAST_SYNC_FILE" 2>/dev/null || echo "1970-01-01")
EXPORT_DIR="/mnt/export/hafiz-incremental-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$EXPORT_DIR"/{metadata,objects}
# Export changed objects only
POSTGRES_URL="postgresql://hafiz:password@localhost:5432/hafiz"
# Get modified objects since last sync
psql "$POSTGRES_URL" -t -A -c "
SELECT bucket, key FROM objects
WHERE updated_at > '$LAST_SYNC'
ORDER BY bucket, key
" > "$EXPORT_DIR/metadata/changed_objects.txt"
# Export changed objects
while IFS='|' read -r bucket key; do
mkdir -p "$EXPORT_DIR/objects/$bucket/$(dirname $key)"
aws --endpoint-url http://localhost:9000 s3 cp \
"s3://$bucket/$key" \
"$EXPORT_DIR/objects/$bucket/$key"
done < "$EXPORT_DIR/metadata/changed_objects.txt"
# Update last sync timestamp
date -Iseconds > "$LAST_SYNC_FILE"
# Create archive
tar -cvf "$EXPORT_DIR.tar" -C "$(dirname $EXPORT_DIR)" "$(basename $EXPORT_DIR)"
sha256sum "$EXPORT_DIR.tar" > "$EXPORT_DIR.tar.sha256"
Security Considerations for Air-Gapped Systems¶
- Media Handling
- Use hardware-encrypted USB drives
- Implement chain-of-custody procedures
-
Scan media for malware before import
-
Data Integrity
- Always verify checksums before import
- Use multiple checksum algorithms (SHA-256, SHA-512)
-
Keep export logs for audit
-
Access Control
- Limit export/import permissions to authorized personnel
- Log all export/import operations
-
Implement two-person rule for sensitive data
-
Automation
Failover and Recovery¶
Automatic Failover with Keepalived¶
# Install on load balancer nodes
sudo dnf install -y keepalived
# /etc/keepalived/keepalived.conf (Primary)
vrrp_instance HAFIZ_VIP {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass hafiz_secret
}
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_haproxy
}
}
vrrp_script chk_haproxy {
script "killall -0 haproxy"
interval 2
weight 2
}
Manual Failover Procedure¶
# 1. Stop primary cluster nodes
for node in 192.168.1.{10,11,12}; do
ssh root@$node "docker stop hafiz"
done
# 2. Promote secondary PostgreSQL (if using streaming replication)
ssh root@10.0.1.5 "sudo -u postgres psql -c 'SELECT pg_promote()'"
# 3. Update DNS or VIP to point to secondary
# 4. Start secondary nodes if not running
# 5. Verify service
aws --endpoint-url https://hafiz-secondary.example.com s3 ls
Node Management API¶
Hafiz provides REST APIs for managing cluster nodes programmatically.
Get Cluster Status¶
Response:
{
"enabled": true,
"cluster_name": "production",
"local_node": {
"id": "node-1",
"name": "Node 1",
"endpoint": "http://192.168.1.10:9000",
"role": "primary",
"status": "healthy"
},
"stats": {
"total_nodes": 3,
"healthy_nodes": 3,
"total_objects": 15420,
"total_storage_bytes": 1073741824,
"pending_replications": 0,
"replication_lag_secs": 0
}
}
List Cluster Nodes¶
Drain a Node (Maintenance Mode)¶
Draining a node gracefully stops it from accepting new writes and completes pending replications:
curl -X POST http://localhost:9000/api/v1/cluster/nodes/<node-id>/drain \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"graceful": true, "timeout_secs": 300}'
Response:
{
"status": "draining",
"node_id": "node-2",
"message": "Node drain initiated. The node will stop accepting writes and finish pending replications."
}
Remove a Node from Cluster¶
Remove a node permanently from the cluster:
curl -X DELETE http://localhost:9000/api/v1/cluster/nodes/<node-id> \
-H "Authorization: Bearer <token>"
Response:
{
"status": "removed",
"node_id": "node-2",
"message": "Node removed from cluster. Data rebalancing may be needed if the node held unique data."
}
Replication Statistics¶
curl -X GET http://localhost:9000/api/v1/cluster/replication/stats \
-H "Authorization: Bearer <token>"
Response:
{
"events_processed": 15420,
"successful": 15418,
"failed": 2,
"pending": 0,
"in_progress": 0,
"bytes_replicated": 1073741824,
"avg_latency_ms": 12.5
}
Maintenance Workflow¶
For planned maintenance on a node:
# 1. Drain the node (stop new writes, complete pending work)
curl -X POST http://localhost:9000/api/v1/cluster/nodes/node-2/drain \
-H "Authorization: Bearer <token>" \
-d '{"graceful": true}'
# 2. Wait for drain to complete (check status)
curl -X GET http://localhost:9000/api/v1/cluster/nodes/node-2 \
-H "Authorization: Bearer <token>"
# Check that status is "draining" or "drained"
# 3. Perform maintenance on the node
ssh root@node-2 "systemctl stop hafiz && dnf update -y && systemctl start hafiz"
# 4. Verify node rejoins cluster
curl -X GET http://localhost:9000/api/v1/cluster/nodes \
-H "Authorization: Bearer <token>"