Automatic service recovery with OS-level zombie process prevention. Four layers of protection ensure your BizNode stays online with zero manual intervention — even after crashes, reboots, or runaway processes.
BizNode uses a four-layer defense system to guarantee uptime. Each layer watches the one below it, creating a chain of supervision that eliminates single points of failure. If one layer goes down, the layer above it brings it back automatically.
Monitors all services every 30 seconds. Detects failures via HTTP, TCP, and process checks. Auto-restarts downed services in dependency order.
Watches the watchdog itself every 5 minutes. If the watchdog daemon crashes or stalls, the guardian restarts it — ensuring the monitor is always monitored.
OS-level file locks prevent duplicate processes. No more zombie watchdogs or double bot instances — the OS enforces exactly one copy of each service.
Boots all services in dependency order at system logon. PostgreSQL first, then Qdrant, Ollama, Dashboard, and finally the Bot — every time, reliably.
Services depend on each other in a strict chain. The self-healing system understands this dependency order and uses it for both startup sequencing and failure recovery. When a service goes down, the system identifies what failed, categorizes the failure, and takes the right corrective action.
Database layer — must start first. Health checked via TCP connection on port 5432.
Vector database for knowledge base and semantic search. Health checked via HTTP on port 6333.
Local LLM inference engine. Health checked via HTTP on port 11434. Requires Qdrant for embedding storage.
FastAPI server on port 7777. Health checked via HTTP. Requires database and LLM to be running.
Telegram bot — the user-facing interface. Starts last because it depends on all upstream services.
| Method | Used For | How It Works |
|---|---|---|
HTTP | Dashboard, Ollama, Qdrant | GET request to health endpoint — expects 200 OK within timeout |
TCP | PostgreSQL | Socket connection to port — success means the service is accepting connections |
Process | Bot, Watchdog | Checks if the process ID is alive and the singleton lock file is held |
One of the most common problems in service management on Windows is zombie processes. Task Scheduler fires, a service restarts, or a user double-clicks a launcher — and suddenly there are two copies of the same service running, fighting over the same ports and database connections.
Windows Task Scheduler and service restarts can spawn duplicate processes. Two watchdog daemons means double alerts and conflicting restart commands. Two bot instances means duplicate message handling and database corruption. Traditional PID-file checks are unreliable — the process can crash without cleaning up its PID file, leaving a stale lock that blocks the real process from starting.
BizNode uses OS-level file locking — msvcrt.locking() on Windows,
fcntl.flock() on Linux — to enforce exactly one instance of each critical service.
# Acquire an exclusive lock on a service-specific file
lock_file = open("watchdog/watchdog.lock", "w")
msvcrt.locking(lock_file.fileno(), msvcrt.LK_NBLCK, 1)
# If another instance is already running, locking raises an exception
# → the duplicate process exits immediately
# If the process crashes, the OS automatically releases the lock
# → no stale PID files, no manual cleanup, no race conditions
msvcrt on Windows, fcntl on Linux, same behavior on bothOnly one watchdog instance runs at a time. Prevents duplicate health checks and conflicting restart commands.
The persistent background agent that manages task queues and orchestration. Singleton lock prevents task duplication.
The watchdog's watchdog. Protected by its own singleton lock to prevent guardian duplication.
The boot orchestrator. Lock ensures only one startup sequence runs, even if Task Scheduler fires twice.
The Agent Runtime is a persistent background service that starts at system logon and runs continuously. It serves as the bridge between BizNode's task management layer and the AI orchestration engine.
The Agent Runtime is protected by its own singleton lock, ensuring exactly one instance runs at all times. It coordinates between the runtime execution layer (task processing, queue management) and the AI orchestration layer (LLM calls, decision-making, autonomous actions).
System Logon ↓ Task Scheduler triggers startup sequence ↓ Singleton lock acquired for agent_runtime.lock ↓ Task queue initialized ↓ Node orchestrator connects to mesh network ↓ Reputation service starts tracking ↓ Runtime enters main loop (process tasks, respond to events) ... Process Exit / Crash ↓ OS releases singleton lock automatically ↓ Guardian detects agent runtime is down ↓ Guardian restarts agent runtime
While the self-healing system works autonomously, you have full visibility into what it is doing. The BizNode dashboard provides real-time service status, manual controls, and failure history.
See the health of every service at a glance — green for running, red for down, amber for degraded. Updated every 30 seconds by the watchdog.
Manual start, stop, and restart controls for each service. Override the watchdog when you need to take a service down for maintenance.
Full log of health checks, failures, and recovery actions. See when services went down, why, and how long recovery took.
Instant notifications to the node owner when a service fails or recovers. Includes failure category, affected service, and recovery action taken.
Download BizNode to get self-healing infrastructure out of the box — watchdog, guardian, singleton locks, and intelligent recovery, all preconfigured.
Download BizNode →