More hardening and migration from Drone to Woodpecker

2026-02-01 19:11:32 -03:00
parent a0f148c3ef
commit 5035ed118d
12 changed files with 558 additions and 112 deletions
--- a/docs/CI-CD.md
+++ b/docs/CI-CD.md
@@ -0,0 +1,39 @@
+# CI/CD — Woodpecker
+
+This repo uses [Woodpecker CI](https://woodpecker-ci.org/) for build and deploy. Pipelines were migrated from Drone.
+
+## Workflows
+
+| Workflow    | Trigger              | Description |
+|------------|----------------------|-------------|
+| **build**  | Push to `main`       | Build and push Docker image to `git.mifi.dev/mifi-holdings/mail-autoconfig` (with `auto_tag`), then send status to lab webhook. |
+| **production** | Deployment to `production` | No clone; POST to Portainer stack webhook, then send status to lab webhook. |
+
+Workflows are defined in [.woodpecker/build.yaml](../.woodpecker/build.yaml) and [.woodpecker/production.yaml](../.woodpecker/production.yaml).
+
+## Woodpecker setup
+
+1. **Add the repo** in Woodpecker and point it at your forge (Gitea/GitHub/etc.).
+
+2. **Secrets** (repository secrets):
+   - **Build workflow:** `gitea_registry_username`, `gitea_package_token` — Registry credentials for `git.mifi.dev`.
+   - **Production workflow:** `portainer_webhook_url` — Portainer stack webhook URL (do not commit; anyone with this URL can trigger a redeploy). `discord_webhook_url` — Discord incoming webhook for deploy notifications.
+
+3. **Trusted project**: In the repo’s Woodpecker project settings, enable **Trusted** so the Docker build step can run with `privileged: true`.
+
+4. **Production deploy**: To deploy to production, trigger a **deployment** in Woodpecker with target **production** (e.g. from the pipeline UI after a successful build).
+
+## Migration from Drone
+
+| Drone                         | Woodpecker |
+|------------------------------|------------|
+| `.drone.yml` (two pipelines) | `.woodpecker/build.yaml` + `.woodpecker/production.yaml` |
+| `plugins/docker`             | `woodpeckerci/plugin-docker-buildx` |
+| `plugins/webhook`            | `curlimages/curl` with inline commands |
+| `promote` → target `production` | Deployment event with `CI_PIPELINE_DEPLOY_TARGET == "production"` |
+| `from_secret: gitea_package_token` | Same secret name in Woodpecker; add `gitea_registry_username` for registry login |
+
+- **Build pipeline**: Runs on push to `main`; builds and pushes the image with `auto_tag` (e.g. `latest`, branch, semver on tags).
+- **Production pipeline**: No clone; runs when you trigger a deployment to **production**; calls Portainer webhook and lab webhook.
+
+You can remove [.drone.yml](../.drone.yml) after switching to Woodpecker, or keep it for reference.
--- a/docs/SECURITY-AUDIT.md
+++ b/docs/SECURITY-AUDIT.md
@@ -0,0 +1,236 @@
+# Security Audit Report — Mail Autoconfig
+
+**Audit date:** 2025-02-01  
+**Scope:** Full application and Docker deployment hardening; prevention of host/container compromise.
+
+**Re-audit (Linode / past compromise):** 2025-02-01 — Additional hardening applied after reported server takeovers; dependency CVEs and bind-address issues addressed.
+
+---
+
+## Executive Summary
+
+The application already has solid security measures (host validation, domain sanitization, rate limiting, non-root container, read-only filesystem). This audit identified **one critical deployment bug**, several **medium-priority** hardening gaps, and **low-priority** improvements. Addressing the critical and medium items will significantly reduce risk of abuse and container/host compromise.
+
+---
+
+## Critical
+
+### C1. Docker Compose — Wrong Backend Port (Traefik)
+
+**Location:** `docker-compose.yml`  
+**Issue:** Several Traefik service labels use `loadbalancer.server.port=808080` instead of `8080`. The app listens on `127.0.0.1:8080` (Dockerfile). With port `808080`, Traefik cannot reach the app; those hosts will get connection errors.
+
+**Affected labels (port 808080):**
+- mailconfig-mifi-holdings
+- mailconfig-mifi-com-br
+- mailconfig-mifi-dev
+- mailconfig-mifi-ventures
+- mailconfig-mifi-vix-br
+- mailconfig-mifi-me
+
+**Impact:** Service broken for those domains; no direct security exploit, but must be fixed for correct and consistent deployment.
+
+**Remediation:** Change all `server.port=808080` to `server.port=8080` in `docker-compose.yml`.
+
+---
+
+## High
+
+### H1. X-Forwarded-For Spoofing / Rate Limit Bypass
+
+**Location:** `app.py` — `rate_limit()`  
+**Issue:** Client IP is taken from `X-Forwarded-For` (first value). If Traefik does not overwrite this header with the real client IP, any client can send arbitrary `X-Forwarded-For` values. That allows:
+- Bypassing rate limits by rotating spoofed IPs.
+- Concentrating rate-limit “cost” on a victim IP (abuse/DoS).
+
+**Remediation:**
+1. Prefer **one** trusted proxy header (e.g. `X-Real-IP` or the rightmost/last `X-Forwarded-For` after your proxy) and configure Traefik to set it from the real client.
+2. If you must use `X-Forwarded-For`, ensure Traefik overwrites it (or appends the real client) and that the app only trusts one proxy hop (e.g. via Werkzeug `ProxyFix` with `x_for=1`), then use `request.remote_addr` for rate limiting so the app uses the proxy’s notion of client IP, not raw headers.
+3. Document the chosen header and proxy configuration so future changes don’t re-open spoofing.
+
+### H2. Request Body Size When Content-Length Is Missing
+
+**Location:** `app.py` — `validate_request_size()`  
+**Issue:** Only `request.content_length` is checked. If the client omits `Content-Length` (or uses chunked encoding), the check is skipped. A malicious client could send a very large body and increase memory/CPU usage (DoS).
+
+**Remediation:**
+- For POST (e.g. Autodiscover): require `Content-Length` and reject (e.g. 411 or 400) when missing, or enforce a hard body read limit in WSGI/Gunicorn.
+- Optionally use a small `max_content_length` on the Flask app and/or Gunicorn body size limits so oversized bodies are rejected even when `Content-Length` is missing or wrong.
+
+### H3. Jinja2 Autoescape May Not Apply to `.j2` Templates
+
+**Location:** `app.py` — `jinja2.Environment(autoescape=jinja2.select_autoescape(['xml']))`  
+**Issue:** `select_autoescape(['xml'])` typically matches template **filename extension**. Templates are named `config-v1.1.xml.j2` and `Autodiscover.xml.j2`; the extension used by Jinja2 may be `.j2`, not `.xml`, so XML autoescape might not be enabled. Output is then safe only because `sanitize_domain()` restricts to `[a-zA-Z0-9.-]`.
+
+**Remediation:** Rely on explicit escaping for defense in depth: use `autoescape=True` for this environment, or a custom predicate that returns `True` for these template names (e.g. names containing `xml` or ending in `.xml.j2`). Ensures XML-sensitive characters are escaped if sanitization is ever relaxed or bypassed.
+
+---
+
+## Medium
+
+### M1. /ping Endpoint Does Not Validate Host
+
+**Location:** `app.py` — `ping()`  
+**Issue:** `/ping` has no `@validate_host`. Any `Host` can be used. Impact is low (only reveals that the service is up), but it weakens defense in depth and allows probing behind the proxy.
+
+**Remediation:** Add `@validate_host` to `/ping`, or restrict `/ping` to internal use (e.g. only when request comes from loopback or from Traefik health checks) and document that it must not be exposed to the internet.
+
+### M2. No Explicit Capability Dropping in Container
+
+**Location:** `docker-compose.yml`  
+**Issue:** `security_opt: no-new-privileges:true` is set, but capabilities are not explicitly dropped. Default Docker capabilities (e.g. NET_RAW, SETPCAP) are still available; dropping unneeded ones reduces attack surface.
+
+**Remediation:** Add:
+```yaml
+cap_drop:
+  - ALL
+```
+If the process or healthcheck needs specific capabilities (unlikely for this app), add them back with `cap_add` only as needed.
+
+### M3. Dependency Versions Not Pinned
+
+**Location:** `Dockerfile` — `pip install Flask Jinja2 gunicorn`  
+**Issue:** No `requirements.txt` with versions. Builds can pull different versions over time, leading to supply-chain and compatibility risk and making security patches harder to track.
+
+**Remediation:** Add `requirements.txt` with pinned versions (e.g. `Flask==3.x.x`, `Jinja2==3.x.x`, `gunicorn==21.x.x`), use `pip install -r requirements.txt` in the Dockerfile, and review/update versions in a controlled way.
+
+### M4. Rate Limit Storage Unbounded Growth
+
+**Location:** `app.py` — `defaultdict(deque)`  
+**Issue:** Old entries are removed per-IP when that IP makes a new request, but IPs that stop requesting are never cleaned. Under heavy scanning (many distinct IPs), memory can grow without bound.
+
+**Remediation:** Use a TTL-based structure (e.g. Redis with expiry), or a background task / periodic cleanup that drops keys older than `RATE_LIMIT_WINDOW`. For in-memory, a single global cleanup on each request (e.g. prune keys whose last request is older than the window) limits growth.
+
+---
+
+## Low
+
+### L1. Unused Import
+
+**Location:** `app.py` — `import html`  
+**Issue:** `html` is never used. Minor hygiene and static-analysis noise.
+
+**Remediation:** Remove `import html`.
+
+### L2. Host Header in HTML Links
+
+**Location:** `app.py` — `index()`  
+**Status:** `host` is taken from the request but only after `@validate_host`, so it is one of `ALLOWED_HOSTS`. Using it in links is safe. No change required; noted for completeness.
+
+### L3. Flask Secret Key
+
+**Location:** `app.py`  
+**Status:** No sessions or signed cookies are used. No `SECRET_KEY` is required for current functionality. If you add session or cookie-based features later, set a strong `app.secret_key` (from env, not hardcoded).
+
+---
+
+## Remediation Plan
+
+| Priority | ID   | Action | Status |
+|----------|------|--------|--------|
+| Critical | C1  | Fix `docker-compose.yml`: change all `server.port=808080` to `8080`. | **Done** |
+| High     | H1  | Use trusted proxy (ProxyFix) and rate-limit by `request.remote_addr`; document proxy config. | **Done** |
+| High     | H2  | Enforce body size: require Content-Length for POST; set `MAX_CONTENT_LENGTH`. | **Done** |
+| High     | H3  | Set Jinja2 `autoescape=True` so XML templates are always escaped. | **Done** |
+| Medium   | M1  | Add `@validate_host` to `/ping`; healthcheck must send allowed Host. | **Done** |
+| Medium   | M2  | Add `cap_drop: [ALL]` in `docker-compose.yml`. | **Done** |
+| Medium   | M3  | Add `requirements.txt` with pinned versions; use it in Dockerfile. | **Done** |
+| Medium   | M4  | Add periodic cleanup of stale rate-limit keys when storage > 1000. | **Done** |
+| Low      | L1  | Remove unused `import html`. | **Done** |
+| Optional | O1  | Gunicorn `--limit-request-line` / `--limit-request-fields`. | **Done** |
+| Optional | O2  | Add `Permissions-Policy` header. | **Done** |
+
+**Deployment note (H1):** Ensure Traefik overwrites `X-Forwarded-For` with the real client IP (or prepends it). With `ProxyFix(x_for=1)`, the app trusts the leftmost value as the client IP; if the proxy does not set it, clients could spoof it.
+
+---
+
+## Already Strong
+
+- **Host validation:** Strict allowlist and `@validate_host` on sensitive routes.
+- **Domain sanitization:** `sanitize_domain()` restricts to `[a-zA-Z0-9.-]` and length; used before any template render.
+- **Request size limits:** Per-route limits (with the Content-Length gap noted above).
+- **Security headers:** X-Content-Type-Options, X-Frame-Options, CSP, etc.
+- **Container:** Non-root user, bind to 127.0.0.1, read-only filesystem, tmpfs for /tmp, resource limits, `no-new-privileges`.
+- **No user-controlled paths or commands:** No path traversal or command injection surfaces found.
+
+Implementing the critical and high items, then the medium items, will bring the service to a hardened state suitable for production and minimize risk to the host the Docker container runs on.
+
+---
+
+## Verification (Current Codebase)
+
+A follow-up review confirms the following in the current implementation:
+
+| ID | Finding | Verified |
+|----|---------|----------|
+| C1 | All Traefik `loadbalancer.server.port` values are `8080` in `docker-compose.yml`. | ✅ |
+| H1 | `ProxyFix(app.wsgi_app, x_for=1, x_proto=1)` and rate limit uses `request.remote_addr`. | ✅ |
+| H2 | POST without `Content-Length` returns 411; `MAX_CONTENT_LENGTH = 4096` set. | ✅ |
+| H3 | Jinja2 `Environment(autoescape=True)`. | ✅ |
+| M1 | `/ping` has `@validate_host`. | ✅ |
+| M2 | `cap_drop: [ALL]` in `docker-compose.yml`. | ✅ |
+| M3 | `requirements.txt` with pinned Flask, Jinja2, gunicorn; Dockerfile uses `pip install -r requirements.txt`. | ✅ |
+| M4 | `_prune_stale_rate_limits()` when storage > 1000; called on each request. | ✅ |
+| L1 | No `import html` in `app.py`. | ✅ |
+
+No path traversal, command execution, `send_file`, or user-controlled file/path usage. Autodiscover POST body is never read or parsed, so malicious XML does not reach the app. Domain and Host are either allowlisted or sanitized before use in templates or HTML.
+
+---
+
+## Container Escape / Host Compromise Resistance
+
+The deployment is structured to limit impact of a compromised app or container:
+
+1. **Non-root**: Process runs as `appuser`; no root inside container.
+2. **Capabilities**: `cap_drop: ALL`; no `CAP_SYS_ADMIN`, `CAP_NET_RAW`, etc., reducing kernel-level escape options.
+3. **Privilege escalation**: `no-new-privileges:true` prevents gaining new privileges via setuid or similar.
+4. **Filesystem**: `read_only: true` with `tmpfs` only for `/tmp`; no persistent writable host paths; no volume mounts from host.
+5. **Network**: App binds to `127.0.0.1:8080`; only Traefik on the same Docker network can reach it; no `network_mode: host`.
+6. **Resources**: Memory and CPU limits reduce resource-exhaustion and some DoS impact.
+7. **No host access**: No Docker socket, no host PID/filesystem mounts, no `privileged` mode.
+
+An attacker who gains code execution in the app can only affect the container and its tmpfs; they cannot access the host filesystem, other containers, or the host network without a separate host or orchestration vulnerability.
+
+---
+
+## Additional Optional Hardening
+
+| ID | Suggestion | Priority | Status |
+|----|------------|----------|--------|
+| O1 | **Gunicorn request limits** | Low | **Done** — `--limit-request-line 4094` and `--limit-request-fields 100` in Dockerfile CMD. |
+| O2 | **Permissions-Policy header** | Low | **Done** — Added in `add_security_headers()`. |
+| O3 | **Dependency updates** | Operational | Run `pip-audit` or Dependabot periodically; refresh pinned versions in `requirements.txt` after testing. |
+
+O1 and O2 are implemented; O3 is an ongoing operational practice.
+
+---
+
+## Linode / Past Compromise — Additional Hardening (2025-02-01)
+
+After reported takeovers of the Linode server, a second pass identified and fixed the following.
+
+### Critical Fixes Applied
+
+| ID | Issue | Fix |
+|----|--------|-----|
+| **L1** | **Bind 127.0.0.1 made app unreachable by Traefik** | Traefik runs in a *separate* container. Binding to `127.0.0.1:8080` inside this container means only processes *inside this container* can connect; Traefik cannot. Service was either broken or was previously bound to `0.0.0.0`. **Fix:** Bind to `0.0.0.0:8080`. Exposure is limited to the Docker network only (Traefik and any other container on that network). The host and internet cannot connect directly to this port. |
+| **L2** | **Jinja2 3.1.4 had known CVEs** | CVE-2024-56201 (sandbox breakout), CVE-2024-56326, CVE-2025-27516 (|attr filter). Even though this app only passes sanitized `DOMAIN` and fixed template names, defense in depth requires a patched library. **Fix:** Upgraded to Jinja2 3.1.6 in `requirements.txt`. |
+| **L3** | **No HSTS** | Browsers could be downgraded to HTTP without a strict directive. **Fix:** Added `Strict-Transport-Security: max-age=31536000; includeSubDomains; preload` in `add_security_headers()`. |
+| **L4** | **Unsafe HTTP methods not rejected** | TRACE, CONNECT, etc. could be used for probing or abuse. **Fix:** `@app.before_request` rejects any method not in `GET, POST, HEAD, OPTIONS` with 405. |
+| **L5** | **Default error pages could leak info** | Flask’s default 404/500 might expose stack traces or paths in some configs. **Fix:** Custom `@app.errorhandler(404|500|413)` returning minimal JSON only. |
+| **L6** | **Explicit DEBUG/TESTING off** | Ensures debug mode is never enabled by env or mistake. **Fix:** `app.config['DEBUG'] = False` and `app.config['TESTING'] = False`. |
+| **L7** | **Gunicorn worker temp dir** | With read-only rootfs, workers need a writable temp dir. **Fix:** `--worker-tmp-dir /tmp` in Dockerfile CMD (tmpfs). |
+| **L8** | **Network isolation documentation** | Untrusted containers on the same Docker network could reach this service. **Fix:** Comment in `docker-compose.yml`: only attach trusted containers to the `traefik` network. |
+
+### If Your Linode Was Compromised — Checklist
+
+1. **Assume full compromise:** Rotate all secrets (SSH keys, API keys, DB passwords, Traefik certs, any env vars). Revoke and reissue.
+2. **Reimage the host** if you cannot guarantee persistence was removed (kernel modules, cron, authorized_keys, other backdoors).
+3. **Audit every service on the host:** SSH (disable password auth, use keys only), Traefik, Docker socket access, any other containers or processes. Ensure no container runs `privileged` or mounts Docker socket unless strictly required.
+4. **Firewall:** On the Linode, allow only required ports (e.g. 22, 80, 443). Block direct access to Docker daemon and to backend ports from the internet.
+5. **Docker network:** Only attach trusted containers (e.g. Traefik) to the `traefik` network. Do not run untrusted or third-party containers on that network.
+6. **Dependencies:** Run `pip-audit` (or equivalent) in CI and after any `requirements.txt` change. Rebuild and redeploy this image after upgrading Jinja2 and any other pinned deps.
+7. **This image:** Rebuild from current Dockerfile and `requirements.txt` (Jinja2 3.1.6, bind 0.0.0.0, all hardening above). Do not reuse old images that may have had vulnerable Jinja2 or different config.
+8. **Traefik:** Ensure it overwrites/sets `X-Forwarded-For` (or equivalent) with the real client IP so this app’s rate limiting and logging see the correct IP.
+
+Implementing the critical and high items, then the medium items, and the Linode-specific fixes above, brings the service to a fully hardened state and reduces risk of host takeover from this container.
--- a/docs/SECURITY.md
+++ b/docs/SECURITY.md
@@ -0,0 +1,110 @@
+# Security Hardening Summary
+
+## Critical Vulnerabilities Fixed
+
+### 1. ✅ Container Security
+**Issue**: Container running as root user
+**Fix**: 
+- Created non-root user `appuser` in Dockerfile
+- Container now runs with limited privileges
+- Added `no-new-privileges:true` security option
+
+### 2. ✅ Host Header Injection
+**Issue**: Unvalidated `request.host` usage
+**Fix**:
+- Added whitelist of allowed hosts
+- Implemented `@validate_host` decorator
+- All routes now validate Host header before processing
+
+### 3. ✅ Input Sanitization
+**Issue**: Unvalidated domain input in templates
+**Fix**:
+- Added `sanitize_domain()` function with regex validation
+- Domain length and format validation
+- Prevents injection attacks via domain parameter
+
+### 4. ✅ Network Security
+**Issue**: Binding to all interfaces (0.0.0.0)
+**Fix**:
+- Application now binds to localhost only (127.0.0.1:8080)
+- External access through Traefik reverse proxy only
+- Updated all Traefik labels to use port 8080
+
+### 5. ✅ Security Headers
+**Issue**: Missing security headers
+**Fix**:
+- Added comprehensive security headers middleware
+- X-Content-Type-Options: nosniff
+- X-Frame-Options: DENY
+- X-XSS-Protection: 1; mode=block
+- Content-Security-Policy
+- Referrer-Policy
+
+### 6. ✅ Rate Limiting
+**Issue**: No rate limiting or request validation
+**Fix**:
+- Implemented rate limiting per IP address
+- Different limits for different endpoints:
+  - Main page: 50 requests/hour
+  - Health check: 10 requests/minute
+  - Config endpoints: 20 requests/hour
+- Request size validation (512B-2KB depending on endpoint)
+
+### 7. ✅ Container Hardening
+**Issue**: Overprivileged container
+**Fix**:
+- Read-only filesystem with tmpfs for /tmp
+- Resource limits (256MB RAM, 0.5 CPU)
+- Security options preventing privilege escalation
+
+## Security Features Added
+
+### Input Validation
+- Host header validation against whitelist
+- Domain sanitization with regex patterns
+- Request size limits per endpoint
+- Content-Type validation
+
+### Rate Limiting
+- Per-IP rate limiting with sliding window
+- Configurable limits per endpoint type
+- Automatic cleanup of old request records
+
+### Network Security
+- Localhost-only binding
+- Reverse proxy required for external access
+- Updated health checks for new port
+
+### Container Security
+- Non-root user execution
+- Read-only filesystem
+- Resource constraints
+- No new privileges policy
+
+## Deployment Notes
+
+1. **Rebuild the Docker image** after these changes
+2. **Update docker-compose.yml** with the new configuration
+3. **Test all endpoints** to ensure functionality
+4. **Monitor logs** for any security-related errors
+5. **Consider adding Redis** for production rate limiting
+
+## Monitoring Recommendations
+
+- Monitor for 403 (Forbidden host) responses
+- Watch for 429 (Rate limit exceeded) responses
+- Log any invalid domain attempts
+- Monitor resource usage within limits
+
+## Security Audit (2025-02-01)
+
+A full security audit was performed; findings and remediations are documented in **SECURITY-AUDIT.md**. Critical and high-priority items have been addressed in code and deployment. **Important:** Ensure your reverse proxy (e.g. Traefik) overwrites `X-Forwarded-For` with the real client IP so rate limiting and logging use the correct client.
+
+## Additional Security Considerations
+
+For production deployment, consider:
+- Using Redis for distributed rate limiting
+- Implementing proper logging and monitoring
+- Adding WAF (Web Application Firewall) rules
+- Regular security audits and dependency updates
+- Implementing request signing for sensitive endpoints