guenther/README.md

212 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# guenther
A streaming anomaly detection pipeline for Managed-File-Transfer (MFT) infrastructure.
guenther ingests system metrics and application logs in real time, extracts structured
feature vectors per time window, and scores them with an ensemble of unsupervised
detectors — without any labelled training data.
---
## How it works
```
┌─────────────────────────────────────────────────────────────┐
│ Ingestion │
│ MetricCollector (/proc) LogCollector (inotify + Drain3) │
│ SystemctlCollector (service states) │
└────────────────────┬────────────────────────────────────────┘
│ channels (backpressure)
┌────────────────────▼────────────────────────────────────────┐
│ Transformation │
│ TransformEngine 30 s tumbling windows via DuckDB │
│ 45 base features + N Drain3 parameter aggregates │
└────────────────────┬────────────────────────────────────────┘
┌────────────────────▼────────────────────────────────────────┐
│ Detection │
│ EnsembleDetector (RRCF fast/mid/slow · COPOD · MAD) │
│ SEAD online weight adaptation · auto-scaling (3 stages) │
└────────────────────┬────────────────────────────────────────┘
anomalies.jsonl
```
### Packages
| Path | Responsibility |
| -------------------- | -------------------------------------------------------------------------------- |
| `cmd/pipeline` | Entry point, wiring, graceful shutdown |
| `internal/collector` | `MetricCollector` (`/proc`), `LogCollector` (inotify), `SystemctlCollector` |
| `internal/transform` | `TransformEngine` — DuckDB windowed aggregation |
| `internal/detect` | `EnsembleDetector`, RRCF, COPOD, MAD, IsolationForest, SEAD, `ScalingController` |
| `internal/drain3` | Masking / parameter extraction wrapper around Drain3 |
| `internal/config` | YAML config loading and regex compilation |
| `internal/health` | `HealthMonitor` — per-stage counters |
| `pkg/types` | Shared types: `LogEvent`, `MetricSnapshot`, `FeatureVector`, `AnomalyResult` |
---
## Requirements
| Dependency | Notes |
| --------------- | ------------------------------------------------------------ |
| Docker | Required for the containerised build (recommended) |
| Go ≥ 1.25 | Only needed for local builds |
| gcc / libc6-dev | CGO is required by `go-duckdb` |
| Linux | Metric collection reads `/proc`; not supported on other OSes |
---
## Building
### Docker (recommended — no local toolchain needed)
```bash
make build
```
The binary is written to `build/guenther`.
### Local (requires Go + gcc)
```bash
make build-local
```
---
## Running
```bash
./build/guenther -config configs/default.yaml
```
guenther shuts down cleanly on `SIGINT` or `SIGTERM`.
---
## Testing
```bash
make test
```
---
## Configuration
guenther is configured via a single YAML file (default: `configs/default.yaml`).
```yaml
ingestion:
log_path: "/path/to/log/file/transfer.log" # file to tail
net_interface: "ens4" # interface for /proc/net/dev
disk_device: "vda1" # device for /proc/diskstats
systemctl_services:
- service1.service
- service2.service
transformation:
window_size: "30s" # tumbling window length
db_path: "data/pipeline.duckdb" # DuckDB file (use :memory: for ephemeral)
drain:
depth: 4
sim_threshold: 0.4
max_children: 100
max_clusters: 1000
masking_patterns: # applied in order before template mining
- name: "uuid"
pattern: '\b[0-9a-fA-F]{8}-...\b'
replace: "<UUID>"
type: "string"
# ... see configs/default.yaml for the full set
detector:
method: "ensemble" # fallback when ensemble.enabled = false
ensemble:
enabled: true
method: "sead" # avg | max | median | sead
contamination: 0.15
sead:
eta: 0.1
lambda: 0.01
auto_scaling:
enabled: true
high_threshold: 75.0 # CPU % → switch to mid detector
critical_threshold: 90.0 # CPU % → switch to fast detector
down_threshold: 50.0
high_duration: 90.0 # seconds load must persist before scaling
critical_duration: 120.0
down_duration: 120.0
rrcf_variants:
fast: { num_trees: 50, tree_size: 32, threshold_percentile: 0.85 }
mid: { num_trees: 150, tree_size: 64, threshold_percentile: 0.85 }
slow: { num_trees: 200, tree_size: 128, threshold_percentile: 0.85 }
copod:
buffer_size: 50
threshold: 0.3
mad:
threshold: 3.5
calibration_size: 50
output:
feature_log_path: "logs/features.jsonl"
anomaly_log_path: "logs/anomalies.jsonl"
```
### Masking pattern types
Patterns with `type: float` extract a named parameter into `FeatureVector.ParamAvg`;
patterns with `type: string` replace the match in-place before template mining.
Named patterns (`name != ""`) are aggregated as features per window.
---
## Output
**`logs/anomalies.jsonl`** — one JSON object per scored window:
```json
{
"timestamp": "2026-01-15T14:32:00Z",
"score": 0.8721,
"is_anomaly": true,
"confidence": 0.91,
"method": "sead_ensemble",
"details": "rrcf_slow=0.91 copod=0.83 mad=0.78"
}
```
**`logs/features.jsonl`** — raw feature vectors for offline analysis (optional).
---
## Project layout
```
guenther/
├── cmd/
│ └── pipeline/
│ └── main.go
├── internal/
│ ├── collector/
│ ├── config/
│ ├── detect/
│ ├── drain3/
│ ├── health/
│ └── transform/
├── pkg/
│ └── types/
├── configs/
│ └── default.yaml
├── build/ # created by `make build`
├── Makefile
└── README.md
```
---
## License
This project was developed as part of a Bachelor's thesis.