# guenther A streaming anomaly detection pipeline for Managed-File-Transfer (MFT) infrastructure. guenther ingests system metrics and application logs in real time, extracts structured feature vectors per time window, and scores them with an ensemble of unsupervised detectors — without any labelled training data. --- ## How it works ``` ┌─────────────────────────────────────────────────────────────┐ │ Ingestion │ │ MetricCollector (/proc) LogCollector (inotify + Drain3) │ │ SystemctlCollector (service states) │ └────────────────────┬────────────────────────────────────────┘ │ channels (backpressure) ┌────────────────────▼────────────────────────────────────────┐ │ Transformation │ │ TransformEngine – 30 s tumbling windows via DuckDB │ │ 45 base features + N Drain3 parameter aggregates │ └────────────────────┬────────────────────────────────────────┘ │ ┌────────────────────▼────────────────────────────────────────┐ │ Detection │ │ EnsembleDetector (RRCF fast/mid/slow · COPOD · MAD) │ │ SEAD online weight adaptation · auto-scaling (3 stages) │ └────────────────────┬────────────────────────────────────────┘ │ anomalies.jsonl ``` ### Packages | Path | Responsibility | | -------------------- | -------------------------------------------------------------------------------- | | `cmd/pipeline` | Entry point, wiring, graceful shutdown | | `internal/collector` | `MetricCollector` (`/proc`), `LogCollector` (inotify), `SystemctlCollector` | | `internal/transform` | `TransformEngine` — DuckDB windowed aggregation | | `internal/detect` | `EnsembleDetector`, RRCF, COPOD, MAD, IsolationForest, SEAD, `ScalingController` | | `internal/drain3` | Masking / parameter extraction wrapper around Drain3 | | `internal/config` | YAML config loading and regex compilation | | `internal/health` | `HealthMonitor` — per-stage counters | | `pkg/types` | Shared types: `LogEvent`, `MetricSnapshot`, `FeatureVector`, `AnomalyResult` | --- ## Requirements | Dependency | Notes | | --------------- | ------------------------------------------------------------ | | Docker | Required for the containerised build (recommended) | | Go ≥ 1.25 | Only needed for local builds | | gcc / libc6-dev | CGO is required by `go-duckdb` | | Linux | Metric collection reads `/proc`; not supported on other OSes | --- ## Building ### Docker (recommended — no local toolchain needed) ```bash make build ``` The binary is written to `build/guenther`. ### Local (requires Go + gcc) ```bash make build-local ``` --- ## Running ```bash ./build/guenther -config configs/default.yaml ``` guenther shuts down cleanly on `SIGINT` or `SIGTERM`. --- ## Testing ```bash make test ``` --- ## Configuration guenther is configured via a single YAML file (default: `configs/default.yaml`). ```yaml ingestion: log_path: "/path/to/log/file/transfer.log" # file to tail net_interface: "ens4" # interface for /proc/net/dev disk_device: "vda1" # device for /proc/diskstats systemctl_services: - service1.service - service2.service transformation: window_size: "30s" # tumbling window length db_path: "data/pipeline.duckdb" # DuckDB file (use :memory: for ephemeral) drain: depth: 4 sim_threshold: 0.4 max_children: 100 max_clusters: 1000 masking_patterns: # applied in order before template mining - name: "uuid" pattern: '\b[0-9a-fA-F]{8}-...\b' replace: "" type: "string" # ... see configs/default.yaml for the full set detector: method: "ensemble" # fallback when ensemble.enabled = false ensemble: enabled: true method: "sead" # avg | max | median | sead contamination: 0.15 sead: eta: 0.1 lambda: 0.01 auto_scaling: enabled: true high_threshold: 75.0 # CPU % → switch to mid detector critical_threshold: 90.0 # CPU % → switch to fast detector down_threshold: 50.0 high_duration: 90.0 # seconds load must persist before scaling critical_duration: 120.0 down_duration: 120.0 rrcf_variants: fast: { num_trees: 50, tree_size: 32, threshold_percentile: 0.85 } mid: { num_trees: 150, tree_size: 64, threshold_percentile: 0.85 } slow: { num_trees: 200, tree_size: 128, threshold_percentile: 0.85 } copod: buffer_size: 50 threshold: 0.3 mad: threshold: 3.5 calibration_size: 50 output: feature_log_path: "logs/features.jsonl" anomaly_log_path: "logs/anomalies.jsonl" ``` ### Masking pattern types Patterns with `type: float` extract a named parameter into `FeatureVector.ParamAvg`; patterns with `type: string` replace the match in-place before template mining. Named patterns (`name != ""`) are aggregated as features per window. --- ## Output **`logs/anomalies.jsonl`** — one JSON object per scored window: ```json { "timestamp": "2026-01-15T14:32:00Z", "score": 0.8721, "is_anomaly": true, "confidence": 0.91, "method": "sead_ensemble", "details": "rrcf_slow=0.91 copod=0.83 mad=0.78" } ``` **`logs/features.jsonl`** — raw feature vectors for offline analysis (optional). --- ## Project layout ``` guenther/ ├── cmd/ │ └── pipeline/ │ └── main.go ├── internal/ │ ├── collector/ │ ├── config/ │ ├── detect/ │ ├── drain3/ │ ├── health/ │ └── transform/ ├── pkg/ │ └── types/ ├── configs/ │ └── default.yaml ├── build/ # created by `make build` ├── Makefile └── README.md ``` --- ## License This project was developed as part of a Bachelor's thesis.