Shardlyn Threat Model
Overview
This document describes the security architecture, potential threats, and mitigations in the Shardlyn platform. It follows the STRIDE methodology for threat classification.
Examples in this document often use game server workloads because they are a common high-risk/Internet-exposed use case, but the same model applies to web apps, databases, and other workloads managed by Shardlyn.
Related
- Architecture — System design and component overview
- API Reference — Authentication and authorization endpoints
- Deployment Guide — Production hardening steps
System Boundaries
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRUST BOUNDARY 1 │
│ (Public Internet) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Admin │ │ Player │ │ Attacker │ │
│ │ Browser │ │ Client │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└──────────┼───────────────────────┼───────────────────────┼─────────────────┘
│ HTTPS │ Game Protocol │ Various
│ │ │
┌──────────┼───────────────────────┼───────────────────────┼─────────────────┐
│ ▼ │ ▼ │
│ ┌─────────────┐ │ ┌─────────────┐ │
│ │ Web UI │ │ │ Firewall │ │
│ │ (React) │ │ │ (iptables) │ │
│ └──────┬──────┘ │ └─────────────┘ │
│ │ │ │
│ │ REST API │ │
│ ▼ │ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TRUST BOUNDARY 2 │ │
│ │ (Control Plane DMZ) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Control Plane │ │ │
│ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
│ │ │ │ API │ │ Auth │ │Reconcile│ │ │ │
│ │ │ │ Handler │ │ (JWT) │ │ Loop │ │ │ │
│ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │
│ │ └─────────────────────┬───────────────────────────────┘ │ │
│ │ │ │ │
│ └─────────────────────────┼────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────────────────┐ │
│ │ TRUST BOUNDARY 3 │ (Database Zone) │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ PostgreSQL │ │ │
│ │ │ (credentials, │ │ │
│ │ │ state, specs) │ │ │
│ │ └─────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TRUST BOUNDARY 4 │ │
│ │ (Agent Network) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Agent 1 │ │ Agent 2 │ │ Agent N │ │ │
│ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ │
│ │ │ │Docker│◄─┼─────┼───│Game │───┼─────┼──►│ │ │ │ │
│ │ │ │ ├───┼─────┼───►Port │◄──┼─────┼───┤ │ │ │ │
│ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │
│ └──────────┼───────────────────┼───────────────────┼──────────────┘ │
│ │ Heartbeat │ │ Player │
│ │ (HTTPS) │ │ Connections │
│ ▼ │ │ │
│ Control Plane ◄────────────┘ │ │
│ │ │
└──────────────────────────────────────────────────────┼──────────────────────┘
│
Game Traffic
│
▼
Players (Internet)Assets
Critical Assets
| Asset | Description | Confidentiality | Integrity | Availability |
|---|---|---|---|---|
| User Credentials | Passwords, API tokens | HIGH | HIGH | MEDIUM |
| JWT Signing Key | Signs all auth tokens | HIGH | HIGH | HIGH |
| Agent Auth Tokens | Authenticate agents | HIGH | HIGH | HIGH |
| Database | All platform state | HIGH | HIGH | HIGH |
| Workload Data | Game worlds, app uploads, databases, configs | MEDIUM | HIGH | HIGH |
| TF State | Cloud credentials | HIGH | HIGH | MEDIUM |
Secondary Assets
| Asset | Description | Confidentiality | Integrity | Availability |
|---|---|---|---|---|
| Workload Specs | Container definitions | LOW | MEDIUM | MEDIUM |
| Metrics Data | Monitoring info | LOW | LOW | LOW |
| Audit Logs | Action history | MEDIUM | HIGH | MEDIUM |
Threat Analysis (STRIDE)
Spoofing
T1: User Authentication Bypass
Threat: Attacker impersonates legitimate user without credentials.
Vectors:
- Brute force password guessing
- Credential stuffing from leaked databases
- Session hijacking
Mitigations:
- [x] Bcrypt password hashing (cost=10)
- [x] JWT with expiration (1 hour)
- [x] Rate limiting on login (configurable)
- [x] Account lockout after failed attempts (configurable)
- [ ] Multi-factor authentication (future)
Residual Risk: MEDIUM
T2: Agent Impersonation
Threat: Attacker registers rogue agent to receive workload specs or report false states.
Vectors:
- Stolen registration token
- Man-in-the-middle during registration
- Leaked agent auth token
Mitigations:
- [x] One-time registration tokens (can only be used once)
- [x] Token expiration (default 24h)
- [x] Unique agent auth token per agent
- [x] TLS required for agent communication (configurable)
- [ ] Agent certificate pinning (future)
Residual Risk: MEDIUM
Tampering
T3: Workload Spec Injection
Threat: Attacker modifies workload spec to run malicious containers.
Vectors:
- SQL injection in spec storage
- Malicious spec via API
- Spec modification in transit
Mitigations:
- [x] Parameterized SQL queries (pgx)
- [x] JSON Schema validation of specs
- [x] RBAC on workload creation
- [ ] Image allowlist (future)
- [ ] Spec signing (future)
Residual Risk: MEDIUM
T4: Container Escape
Threat: Malicious container escapes isolation to compromise host.
Vectors:
- Kernel exploits
- Docker socket access
- Privileged containers
- Host path mounts
Mitigations:
- [x] No privileged containers by default
- [x] No host network mode
- [x] Volume mounts restricted to data directory
- [ ] Seccomp profiles (future)
- [ ] AppArmor/SELinux (future)
- [ ] User namespaces (future)
Residual Risk: MEDIUM-HIGH (Docker inherent risk)
T5: Database Tampering
Threat: Attacker modifies database directly to escalate privileges or corrupt state.
Vectors:
- SQL injection
- Direct database access
- Backup restoration of old data
Mitigations:
- [x] Parameterized queries throughout
- [x] Database network isolation (docker network)
- [x] Strong database password
- [ ] Database encryption at rest (production)
- [ ] Regular integrity checks (future)
Residual Risk: LOW
Repudiation
T6: Action Denial
Threat: User denies performing destructive action (e.g., deleting instances).
Vectors:
- Shared accounts
- Session hijacking
- Insider threats
Mitigations:
- [x] Audit logging with user ID, timestamp, action
- [x] Correlation IDs for request tracing
- [ ] Immutable audit log storage (future)
- [ ] Audit log integrity verification (future)
Residual Risk: LOW
Information Disclosure
T7: Credential Exposure
Threat: Sensitive credentials leaked through logs, APIs, or storage.
Vectors:
- Credentials in error messages
- Debug logging of secrets
- Insecure storage
Mitigations:
- [x] Password hashes excluded from API responses (json:"-")
- [x] TF state excluded from API responses
- [x] Structured logging (no credential interpolation)
- [x] Secrets stored as hashes where possible
- [ ] Secret scanning in CI (future)
Residual Risk: LOW
T8: Workload Spec Leakage
Threat: Workload specs with environment variables (containing secrets) exposed.
Vectors:
- Unauthorized API access
- Log exposure
- Database dump
Mitigations:
- [x] RBAC on workload access
- [ ] Environment variable encryption at rest (future)
- [ ] Secret management integration (Vault) (future)
Residual Risk: MEDIUM
T9: Network Sniffing
Threat: Attacker captures sensitive data from network traffic.
Vectors:
- Unencrypted control plane traffic
- Unencrypted agent heartbeats
- Man-in-the-middle attacks
Mitigations:
- [x] TLS enforcement in production (configurable)
- [ ] Certificate validation (TODO)
- [x] Sensitive data not sent in query parameters
Residual Risk: HIGH (without TLS)
Denial of Service
T10: API Exhaustion
Threat: Attacker overwhelms control plane with requests.
Vectors:
- Login brute force
- Instance creation spam
- WebSocket connection flood
Mitigations:
- [x] Rate limiting per IP/user (configurable)
- [ ] Request size limits
- [x] Database connection pooling (prevents exhaustion)
- [x] WebSocket connection limits (configurable)
Residual Risk: MEDIUM
T11: Resource Exhaustion
Threat: Malicious workload consumes all node resources.
Vectors:
- CPU bomb
- Memory exhaustion
- Disk fill
Mitigations:
- [x] Resource limits enforced (CPU, memory)
- [x] Volume size limits in spec
- [ ] Per-user resource quotas (future)
- [ ] Automatic workload eviction (future)
Residual Risk: MEDIUM
T12: Agent Starvation
Threat: Control plane unavailable, agents cannot receive desired state.
Vectors:
- Control plane crash
- Network partition
- Database failure
Mitigations:
- [x] Agents continue running existing containers
- [x] Idempotent operations (safe retries)
- [ ] Control plane HA (future)
- [ ] Agent local caching of last known state (future)
Residual Risk: MEDIUM
Elevation of Privilege
T13: RBAC Bypass
Threat: User gains admin privileges without authorization.
Vectors:
- JWT manipulation
- Role escalation bugs
- Horizontal privilege escalation
Mitigations:
- [x] Role stored in JWT, validated server-side
- [x] RBAC checks on all admin endpoints
- [x] User can only view own profile (unless admin)
- [ ] Regular RBAC audit (operational)
Residual Risk: LOW
T14: Agent Privilege Escalation
Threat: Compromised agent gains control plane access.
Vectors:
- Agent credential reuse
- Control plane API exposure to agents
Mitigations:
- [x] Agents have separate auth mechanism (X-Agent-Token)
- [x] Agents cannot access user management APIs
- [x] Agent tokens scoped to specific agent ID
- [ ] Network segmentation (agents in separate VLAN)
Residual Risk: MEDIUM
Security Controls Summary
Implemented
| Control | Description | Threats Mitigated |
|---|---|---|
| Password Hashing | bcrypt cost=10 | T1 |
| JWT Authentication | Signed, expiring tokens (15m access) | T1, T13 |
| One-Time Tokens | Agent registration tokens | T2 |
| Parameterized Queries | SQL injection prevention | T3, T5 |
| JSON Schema Validation | Spec format enforcement | T3 |
| RBAC | Role-based access control | T3, T7, T13 |
| Audit Logging | Action tracking | T6 |
| Field Exclusion | Secrets excluded from JSON | T7 |
| Resource Limits | Container CPU/memory caps | T11 |
| Connection Pooling | DB connection management | T10 |
| Rate Limiting | Token bucket per IP/user | T1, T10 |
| Account Lockout | Login lockout after failed attempts | T1 |
| TLS Enforcement | Configurable HTTPS requirement | T2, T9 |
| WebSocket Connection Limits | Cap concurrent WS sessions | T10 |
Planned (TODO)
| Control | Priority | Threats Mitigated |
|---|---|---|
| Image Allowlist | MEDIUM | T3 |
| Secret Management | MEDIUM | T8 |
| Seccomp Profiles | LOW | T4 |
| Control Plane HA | LOW | T12 |
Not Planned (Accepted Risk)
| Control | Reason |
|---|---|
| Container Signing | Complexity vs. risk for MVP |
| Network Policies | Kubernetes-only feature |
| Hardware Security Modules | Cost prohibitive for target users |
Attack Scenarios
Scenario 1: Malicious Admin
Attacker: Insider with admin access Goal: Exfiltrate workload data or disrupt service
Attack Path:
- Create malicious workload with data exfiltration script
- Deploy to node with valuable workload data
- Container mounts data volume, exfiltrates via network
Mitigations:
- Audit logging tracks who created workload
- Volume mounts restricted to shardlyn data directory
- Network monitoring can detect unusual egress
Scenario 2: Compromised Agent
Attacker: External with agent node access Goal: Pivot to control plane or other agents
Attack Path:
- Compromise agent node (e.g., via vulnerable public workload such as a game server)
- Extract agent auth token from filesystem
- Attempt to use token for broader access
Mitigations:
- Agent token scoped to single agent
- Cannot access admin APIs
- Control plane validates agent ID matches token
Scenario 3: Credential Stuffing
Attacker: External with leaked credential database Goal: Gain user/admin access
Attack Path:
- Obtain leaked email/password combinations
- Automated login attempts against Shardlyn
- Successful login with reused credentials
Mitigations (TODO):
- Rate limiting on login endpoint
- Account lockout after failures
- Breach detection notifications
Security Recommendations
For Operators
- Enable TLS for all production deployments
- Rotate secrets (JWT key, database password) regularly
- Monitor audit logs for suspicious activity
- Keep components updated for security patches
- Network segmentation between control plane and agents
- Backup encryption for database and TF state
For Users
- Use strong, unique passwords
- Don't share accounts
- Review workload specs before deployment
- Monitor resource usage for anomalies
- Report suspicious activity to admins
Incident Response
Detection
- Monitor
shardlyn_http_requests_total{status=~"4.."}for auth failures - Alert on
shardlyn_instances_by_state{state="error"}spikes - Review audit logs for unusual patterns
Containment
- Revoke compromised user/agent tokens
- Isolate affected nodes (network/firewall)
- Stop suspicious instances
Recovery
- Rotate all secrets (JWT key, passwords)
- Regenerate agent tokens
- Restore from known-good backup if needed
- Review audit logs for full scope
Post-Incident
- Document timeline and actions
- Update threat model with new vectors
- Implement additional controls as needed