Self-Healing Infrastructure

Multi-Layer Protection

BizNode uses a four-layer defense system to guarantee uptime. Each layer watches the one below it, creating a chain of supervision that eliminates single points of failure. If one layer goes down, the layer above it brings it back automatically.

👁

Layer 1: Watchdog Daemon

Monitors all services every 30 seconds. Detects failures via HTTP, TCP, and process checks. Auto-restarts downed services in dependency order.

🛡

Layer 2: Guardian

Watches the watchdog itself every 5 minutes. If the watchdog daemon crashes or stalls, the guardian restarts it — ensuring the monitor is always monitored.

🔒

Layer 3: Singleton Locks

OS-level file locks prevent duplicate processes. No more zombie watchdogs or double bot instances — the OS enforces exactly one copy of each service.

⚡

Layer 4: Startup Sequence

Boots all services in dependency order at system logon. PostgreSQL first, then Qdrant, Ollama, Dashboard, and finally the Bot — every time, reliably.

How It Works

Services depend on each other in a strict chain. The self-healing system understands this dependency order and uses it for both startup sequencing and failure recovery. When a service goes down, the system identifies what failed, categorizes the failure, and takes the right corrective action.

Service Dependency Chain

1

PostgreSQL

Database layer — must start first. Health checked via TCP connection on port 5432.

↓

2

Qdrant

Vector database for knowledge base and semantic search. Health checked via HTTP on port 6333.

↓

3

Ollama

Local LLM inference engine. Health checked via HTTP on port 11434. Requires Qdrant for embedding storage.

↓

4

Dashboard

FastAPI server on port 7777. Health checked via HTTP. Requires database and LLM to be running.

↓

5

Bot

Telegram bot — the user-facing interface. Starts last because it depends on all upstream services.

Health Check Methods

Method	Used For	How It Works
`HTTP`	Dashboard, Ollama, Qdrant	GET request to health endpoint — expects 200 OK within timeout
`TCP`	PostgreSQL	Socket connection to port — success means the service is accepting connections
`Process`	Bot, Watchdog	Checks if the process ID is alive and the singleton lock file is held

Intelligent Recovery

Failures are categorized: transient (network blip), persistent (crash loop), or dependency (upstream down)
Transient failures trigger an immediate restart attempt
Persistent failures escalate — the system waits, then retries with increasing backoff
Dependency failures restart the upstream service first, then cascade down the chain
All failure events and recovery actions are sent as Telegram alerts to the node owner

Singleton Process Lock

One of the most common problems in service management on Windows is zombie processes. Task Scheduler fires, a service restarts, or a user double-clicks a launcher — and suddenly there are two copies of the same service running, fighting over the same ports and database connections.

The Problem

Windows Task Scheduler and service restarts can spawn duplicate processes. Two watchdog daemons means double alerts and conflicting restart commands. Two bot instances means duplicate message handling and database corruption. Traditional PID-file checks are unreliable — the process can crash without cleaning up its PID file, leaving a stale lock that blocks the real process from starting.

The Solution: OS-Level File Locking

BizNode uses OS-level file locking — msvcrt.locking() on Windows, fcntl.flock() on Linux — to enforce exactly one instance of each critical service.

Singleton Lock — How It Works

# Acquire an exclusive lock on a service-specific file
lock_file = open("watchdog/watchdog.lock", "w")
msvcrt.locking(lock_file.fileno(), msvcrt.LK_NBLCK, 1)

# If another instance is already running, locking raises an exception
# → the duplicate process exits immediately

# If the process crashes, the OS automatically releases the lock
# → no stale PID files, no manual cleanup, no race conditions

Key Properties

Atomic lock acquisition — no race conditions between two processes trying to start simultaneously
Auto-release on crash — the OS releases the file lock when the process terminates, regardless of how it exits
Cross-platform — msvcrt on Windows, fcntl on Linux, same behavior on both
Zero overhead — file locking is a kernel-level operation, no polling or heartbeats required

Protected Services

👁

Watchdog Daemon

Only one watchdog instance runs at a time. Prevents duplicate health checks and conflicting restart commands.

🤖

Agent Runtime

The persistent background agent that manages task queues and orchestration. Singleton lock prevents task duplication.

🛡

Guardian

The watchdog's watchdog. Protected by its own singleton lock to prevent guardian duplication.

⚡

Startup Sequence

The boot orchestrator. Lock ensures only one startup sequence runs, even if Task Scheduler fires twice.

Agent Runtime

The Agent Runtime is a persistent background service that starts at system logon and runs continuously. It serves as the bridge between BizNode's task management layer and the AI orchestration engine.

What It Manages

Task Queue — processes scheduled and on-demand tasks from the dashboard and bot commands
Node Orchestrator — coordinates with other BizNode instances on the network
Reputation Service — tracks node reliability scores for the mesh network

The Agent Runtime is protected by its own singleton lock, ensuring exactly one instance runs at all times. It coordinates between the runtime execution layer (task processing, queue management) and the AI orchestration layer (LLM calls, decision-making, autonomous actions).

Agent Runtime — Lifecycle

System Logon
  ↓ Task Scheduler triggers startup sequence
  ↓ Singleton lock acquired for agent_runtime.lock
  ↓ Task queue initialized
  ↓ Node orchestrator connects to mesh network
  ↓ Reputation service starts tracking
  ↓ Runtime enters main loop (process tasks, respond to events)
  ...
Process Exit / Crash
  ↓ OS releases singleton lock automatically
  ↓ Guardian detects agent runtime is down
  ↓ Guardian restarts agent runtime

Monitoring Dashboard

While the self-healing system works autonomously, you have full visibility into what it is doing. The BizNode dashboard provides real-time service status, manual controls, and failure history.

📈

Real-Time Service Status

See the health of every service at a glance — green for running, red for down, amber for degraded. Updated every 30 seconds by the watchdog.

🔧

Service Manager

Manual start, stop, and restart controls for each service. Override the watchdog when you need to take a service down for maintenance.

📋

Health History

Full log of health checks, failures, and recovery actions. See when services went down, why, and how long recovery took.

📨

Telegram Alerts

Instant notifications to the node owner when a service fails or recovers. Includes failure category, affected service, and recovery action taken.

Self-Healing Infrastructure — Always Running

Multi-Layer Protection

Layer 1: Watchdog Daemon

Layer 2: Guardian

Layer 3: Singleton Locks

Layer 4: Startup Sequence

How It Works

Service Dependency Chain

PostgreSQL

Qdrant

Ollama

Dashboard

Bot

Health Check Methods

Intelligent Recovery

Singleton Process Lock

The Problem

The Solution: OS-Level File Locking

Key Properties

Protected Services

Watchdog Daemon

Agent Runtime

Guardian

Startup Sequence

Agent Runtime

What It Manages

Monitoring Dashboard

Real-Time Service Status

Service Manager

Health History

Telegram Alerts

Infrastructure That Heals Itself