guenther/README.md

7 KiB
Raw Permalink Blame History

guenther

A streaming anomaly detection pipeline for Managed-File-Transfer (MFT) infrastructure. guenther ingests system metrics and application logs in real time, extracts structured feature vectors per time window, and scores them with an ensemble of unsupervised detectors — without any labelled training data.


How it works

┌─────────────────────────────────────────────────────────────┐
│  Ingestion                                                  │
│  MetricCollector (/proc)  LogCollector (inotify + Drain3)  │
│  SystemctlCollector (service states)                        │
└────────────────────┬────────────────────────────────────────┘
                     │ channels (backpressure)
┌────────────────────▼────────────────────────────────────────┐
│  Transformation                                             │
│  TransformEngine    30 s tumbling windows via DuckDB       │
│  45 base features + N Drain3 parameter aggregates           │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│  Detection                                                  │
│  EnsembleDetector  (RRCF fast/mid/slow · COPOD · MAD)       │
│  SEAD online weight adaptation · auto-scaling (3 stages)   │
└────────────────────┬────────────────────────────────────────┘
                     │
              anomalies.jsonl

Packages

Path Responsibility
cmd/pipeline Entry point, wiring, graceful shutdown
internal/collector MetricCollector (/proc), LogCollector (inotify), SystemctlCollector
internal/transform TransformEngine — DuckDB windowed aggregation
internal/detect EnsembleDetector, RRCF, COPOD, MAD, IsolationForest, SEAD, ScalingController
internal/drain3 Masking / parameter extraction wrapper around Drain3
internal/config YAML config loading and regex compilation
internal/health HealthMonitor — per-stage counters
pkg/types Shared types: LogEvent, MetricSnapshot, FeatureVector, AnomalyResult

Requirements

Dependency Notes
Docker Required for the containerised build (recommended)
Go ≥ 1.25 Only needed for local builds
gcc / libc6-dev CGO is required by go-duckdb
Linux Metric collection reads /proc; not supported on other OSes

Building

make build

The binary is written to build/guenther.

Local (requires Go + gcc)

make build-local

Running

./build/guenther -config configs/default.yaml

guenther shuts down cleanly on SIGINT or SIGTERM.


Testing

make test

Configuration

guenther is configured via a single YAML file (default: configs/default.yaml).

ingestion:
  log_path: "/path/to/log/file/transfer.log" # file to tail
  net_interface: "ens4" # interface for /proc/net/dev
  disk_device: "vda1" # device for /proc/diskstats
  systemctl_services:
    - service1.service
    - service2.service

transformation:
  window_size: "30s" # tumbling window length
  db_path: "data/pipeline.duckdb" # DuckDB file (use :memory: for ephemeral)

drain:
  depth: 4
  sim_threshold: 0.4
  max_children: 100
  max_clusters: 1000
  masking_patterns: # applied in order before template mining
    - name: "uuid"
      pattern: '\b[0-9a-fA-F]{8}-...\b'
      replace: "<UUID>"
      type: "string"
    # ... see configs/default.yaml for the full set

detector:
  method: "ensemble" # fallback when ensemble.enabled = false
  ensemble:
    enabled: true
    method: "sead" # avg | max | median | sead
    contamination: 0.15
    sead:
      eta: 0.1
      lambda: 0.01
  auto_scaling:
    enabled: true
    high_threshold: 75.0 # CPU % → switch to mid detector
    critical_threshold: 90.0 # CPU % → switch to fast detector
    down_threshold: 50.0
    high_duration: 90.0 # seconds load must persist before scaling
    critical_duration: 120.0
    down_duration: 120.0
  rrcf_variants:
    fast: { num_trees: 50, tree_size: 32, threshold_percentile: 0.85 }
    mid: { num_trees: 150, tree_size: 64, threshold_percentile: 0.85 }
    slow: { num_trees: 200, tree_size: 128, threshold_percentile: 0.85 }
  copod:
    buffer_size: 50
    threshold: 0.3
  mad:
    threshold: 3.5
    calibration_size: 50

output:
  feature_log_path: "logs/features.jsonl"
  anomaly_log_path: "logs/anomalies.jsonl"

Masking pattern types

Patterns with type: float extract a named parameter into FeatureVector.ParamAvg; patterns with type: string replace the match in-place before template mining. Named patterns (name != "") are aggregated as features per window.


Output

logs/anomalies.jsonl — one JSON object per scored window:

{
  "timestamp": "2026-01-15T14:32:00Z",
  "score": 0.8721,
  "is_anomaly": true,
  "confidence": 0.91,
  "method": "sead_ensemble",
  "details": "rrcf_slow=0.91 copod=0.83 mad=0.78"
}

logs/features.jsonl — raw feature vectors for offline analysis (optional).


Project layout

guenther/
├── cmd/
│   └── pipeline/
│       └── main.go
├── internal/
│   ├── collector/
│   ├── config/
│   ├── detect/
│   ├── drain3/
│   ├── health/
│   └── transform/
├── pkg/
│   └── types/
├── configs/
│   └── default.yaml
├── build/              # created by `make build`
├── Makefile
└── README.md

License

This project was developed as part of a Bachelor's thesis.