7 KiB
guenther
A streaming anomaly detection pipeline for Managed-File-Transfer (MFT) infrastructure. guenther ingests system metrics and application logs in real time, extracts structured feature vectors per time window, and scores them with an ensemble of unsupervised detectors — without any labelled training data.
How it works
┌─────────────────────────────────────────────────────────────┐
│ Ingestion │
│ MetricCollector (/proc) LogCollector (inotify + Drain3) │
│ SystemctlCollector (service states) │
└────────────────────┬────────────────────────────────────────┘
│ channels (backpressure)
┌────────────────────▼────────────────────────────────────────┐
│ Transformation │
│ TransformEngine – 30 s tumbling windows via DuckDB │
│ 45 base features + N Drain3 parameter aggregates │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ Detection │
│ EnsembleDetector (RRCF fast/mid/slow · COPOD · MAD) │
│ SEAD online weight adaptation · auto-scaling (3 stages) │
└────────────────────┬────────────────────────────────────────┘
│
anomalies.jsonl
Packages
| Path | Responsibility |
|---|---|
cmd/pipeline |
Entry point, wiring, graceful shutdown |
internal/collector |
MetricCollector (/proc), LogCollector (inotify), SystemctlCollector |
internal/transform |
TransformEngine — DuckDB windowed aggregation |
internal/detect |
EnsembleDetector, RRCF, COPOD, MAD, IsolationForest, SEAD, ScalingController |
internal/drain3 |
Masking / parameter extraction wrapper around Drain3 |
internal/config |
YAML config loading and regex compilation |
internal/health |
HealthMonitor — per-stage counters |
pkg/types |
Shared types: LogEvent, MetricSnapshot, FeatureVector, AnomalyResult |
Requirements
| Dependency | Notes |
|---|---|
| Docker | Required for the containerised build (recommended) |
| Go ≥ 1.25 | Only needed for local builds |
| gcc / libc6-dev | CGO is required by go-duckdb |
| Linux | Metric collection reads /proc; not supported on other OSes |
Building
Docker (recommended — no local toolchain needed)
make build
The binary is written to build/guenther.
Local (requires Go + gcc)
make build-local
Running
./build/guenther -config configs/default.yaml
guenther shuts down cleanly on SIGINT or SIGTERM.
Testing
make test
Configuration
guenther is configured via a single YAML file (default: configs/default.yaml).
ingestion:
log_path: "/path/to/log/file/transfer.log" # file to tail
net_interface: "ens4" # interface for /proc/net/dev
disk_device: "vda1" # device for /proc/diskstats
systemctl_services:
- service1.service
- service2.service
transformation:
window_size: "30s" # tumbling window length
db_path: "data/pipeline.duckdb" # DuckDB file (use :memory: for ephemeral)
drain:
depth: 4
sim_threshold: 0.4
max_children: 100
max_clusters: 1000
masking_patterns: # applied in order before template mining
- name: "uuid"
pattern: '\b[0-9a-fA-F]{8}-...\b'
replace: "<UUID>"
type: "string"
# ... see configs/default.yaml for the full set
detector:
method: "ensemble" # fallback when ensemble.enabled = false
ensemble:
enabled: true
method: "sead" # avg | max | median | sead
contamination: 0.15
sead:
eta: 0.1
lambda: 0.01
auto_scaling:
enabled: true
high_threshold: 75.0 # CPU % → switch to mid detector
critical_threshold: 90.0 # CPU % → switch to fast detector
down_threshold: 50.0
high_duration: 90.0 # seconds load must persist before scaling
critical_duration: 120.0
down_duration: 120.0
rrcf_variants:
fast: { num_trees: 50, tree_size: 32, threshold_percentile: 0.85 }
mid: { num_trees: 150, tree_size: 64, threshold_percentile: 0.85 }
slow: { num_trees: 200, tree_size: 128, threshold_percentile: 0.85 }
copod:
buffer_size: 50
threshold: 0.3
mad:
threshold: 3.5
calibration_size: 50
output:
feature_log_path: "logs/features.jsonl"
anomaly_log_path: "logs/anomalies.jsonl"
Masking pattern types
Patterns with type: float extract a named parameter into FeatureVector.ParamAvg;
patterns with type: string replace the match in-place before template mining.
Named patterns (name != "") are aggregated as features per window.
Output
logs/anomalies.jsonl — one JSON object per scored window:
{
"timestamp": "2026-01-15T14:32:00Z",
"score": 0.8721,
"is_anomaly": true,
"confidence": 0.91,
"method": "sead_ensemble",
"details": "rrcf_slow=0.91 copod=0.83 mad=0.78"
}
logs/features.jsonl — raw feature vectors for offline analysis (optional).
Project layout
guenther/
├── cmd/
│ └── pipeline/
│ └── main.go
├── internal/
│ ├── collector/
│ ├── config/
│ ├── detect/
│ ├── drain3/
│ ├── health/
│ └── transform/
├── pkg/
│ └── types/
├── configs/
│ └── default.yaml
├── build/ # created by `make build`
├── Makefile
└── README.md
License
This project was developed as part of a Bachelor's thesis.