Tuvastada juurpõhjus, miks telegram_listener.py cron-loop jätab process_state_log (PSL) read 'running' last_heartbeat vana, AGA agent_runs (AR) vastav rida saab kuidagi 'failed'/'completed'. Koord 22 leitud 63 orphan PSL (61× eirik_os.telegram_listener + 2× claude_code.session).
| Hüpotees | Seisukord | Roll juurpõhjuses |
|---|---|---|
A — Mitte-graceful shutdown (Windows TerminateProcess / SIGBREAK / OS reboot / VS Code crash). telegram_listener käsitleb ainult SIGINT+SIGTERM, mitte atexit ega SIGBREAK. |
✅ KINNITATUD | PRIMARY (koord22 kontekstis) — VBScript detached spawn pattern (Shell.Run False) lõi 60+ paralleelseid daemon-i, mille puhul puhul state.finish() ei käivitunud üldse. Mass-cleanup SQL UPDATE 19:13:17 (kõik 61 AR error_message='killed_ghost_koord22_idempotency_fix') möödas state.finish() lifecycle'ist. |
B — heartbeat_watcher.kill_stale_run non-atomic (PSL UPDATE step 2 try/except log.warning ainult, AR juba updated step 1-s). |
⚠️ KINNITATUD ARHITEKTUURIS, AGA mitte juurpõhjus koord22 | Sekundaarne — Vahimees killed_stale_heartbeat kasutati AINULT 1 telegram_listener juhtum (AR id=48 päev varem 2026-05-26). Koord22 mass-orphan PSL ei tulnud Vahimehe kaudu. R4 find_orphan_psl_zombies + kill_orphan_psl (koord22 lisand) puhastas 28+ orphan PSL 21:39:26 — see palch, mitte parandus. |
C — eirik_os.agent_state.AgentRunState.finish() non-atomic — 2-step sequential UPDATE (PSL, siis AR) ilma rollback'ita. |
✅ KINNITATUD ARHITEKTUURIS | TULEVANE PROBLEEM — kui state.finish()-i ajal step 1 (PSL UPDATE) saab HTTP timeout / võrgu-erroriga, step 2 (AR UPDATE) JÄTKAB ja AR sulgub 'completed' kuigi PSL jääb 'running'. Migration 0060 atomic RPC olemas (kasutatud tier_2_5_helper-s), AGA agent_state.py ei kasuta seda. |
VERDICT: Kombinatsioon A + C — A = koord22 juurpõhjus, C = tuleviku probleemi-allikas.
2026-05-27 19:13:17.525053+00 (SAMA HETK) + error_message = killed_ghost_koord22_idempotency_fix. See viitab ühele SQL UPDATE-le (mitte Vahimees R3 zombie kill cron'ile). Grep killed_ghost_koord22_idempotency_fix kogu repos → 0 faili. Manuaalne SQL koord22 koordinaator-sessioonis (vs_eirik) sõidetud.| AR status | error_message | PSL status | cnt | Mõõte |
|---|---|---|---|---|
| failed | killed_ghost_koord22_idempotency_fix | killed | 61 | Mass-cleanup AR UPDATE 19:13:17 + Vahimees R4 PSL cleanup 21:39:26 (~2h 26min hiljem) |
| failed | killed_ghost_koord22_idempotency_fix | completed | 2 | AR id=111+113 — PSL juba enne sulgenud 14:19:01 (graceful shutdown state.finish() töötas mõlemad updated edukalt, AR jäi 'running' kuna ID-d olid juba mass-cleanup'i ootel) |
| failed | killed_stale_heartbeat (Vahimees M1) | failed | 1 | Eraldi juhtum 2026-05-26 — Vahimees M1 normaalne kill |
| failed | TaskStop SIGKILL — replaced by Task Scheduler daemon id=48 | failed | 1 | Eraldi juhtum 2026-05-26 — manuaalne replace |
| failed | killed_zombie | killed | 1 | Eraldi juhtum 2026-05-27 — Vahimees R3 60min zombie kill |
| running | NULL | running | 1 | Praegune LIVE daemon (AR id=268, PSL id=256) — käivitas 2026-05-27 19:19:02, last_heartbeat 21:55:35 (21 sek tagasi). TERVE. |
AR id=111: started 2026-05-27 13:55:13 → PSL finished 14:19:01 (graceful) → AR mass-cleanup 19:13:17
run_duration: 5h 18min · psl_close_delay: -4h 54min (PSL sulgus AR-ist enne)
AR id=113: started 2026-05-27 14:00:13 → PSL finished 14:19:01 (graceful) → AR mass-cleanup 19:13:17
run_duration: 5h 13min · psl_close_delay: -4h 54min
See on vastupidine orphan PSL-le: PSL sulgus, AR jäi 'running'. Tõenäoselt mõlemad daemon-id said SIGINT (Ctrl+C-iga 14:19:01) → state.finish(status='completed') sõidetud → mõlemad UPDATE-id õnnestusid. AGA hiljem mass-cleanup SQL UPDATE puudutas mõlemaid AR-i (filter WHERE stage='telegram_listener' AND status='failed' OR 'running' VÕI lihtsalt sai veerul stage=telegram_listener mass-kasvataja).
NOW UTC: 2026-05-27 21:55:56 open_psl_telegram: 1 (PSL id=256, last_heartbeat 21:55:35 — TERVE) open_ar_telegram: 1 (AR id=268, started 19:19:02)
✅ Orphan PSL count = 0 — koord22 R4 puhastus + _is_already_running() idempotency-fix ennetab uut kasvu.
run_hidden.vbsShell.Run "python telegram_listener.py", 0, False — False = async, ei ootastate.finish() ei käivitu kunagi (signal_handler ei tule, while-loop ei lõpe)UPDATE agent_runs SET status='failed', error_message='killed_ghost_koord22_idempotency_fix' WHERE stage='telegram_listener' AND status='running' → 61 AR sulgesimetaskkill Windowstelegram_listener.py + _is_already_running() idempotency-check (read 472-499) — uus daemon-tellija peatub kohe kui leiab AR runningShell.Run sync (False→True)heartbeat_watcher.py R4 find_orphan_psl_zombies + kill_orphan_psl — 21:39:26 puhastas 28+ orphan PSL status='killed'agent_state.AgentRunState.finish() migreerima atomic RPC-lePõhjus: Tulevikus võib mistahes daemon (mitte ainult telegram_listener) langeda Hypothesis C all — kui shutdown'i ajal step 1 (PSL UPDATE) saab võrgu-erroriga, AR sulgub edukalt aga PSL jääb 'running' = uus orphan PSL.
Lahendus-kandidaat: eirik_os/agent_state.py read 437-505 finish() meetod → kutsub tier_2_5_helper.finish()-le mille all on Migration 0060 atomic RPC tier_2_5_finish (LIVE kinnitatud information_schema.routines kontrolliga).
def finish(self, status="completed", exit_code=0, counts=None, summary=None, error=None, cost_usd=0.0) -> None:
if not self.enabled or self.run_id is None or self.process_state_id is None:
return
try:
from eirik_os.tier_2_5_helper import finish as _atomic_finish
_atomic_finish(
state={"agent_run_id": self.run_id, "process_state_id": self.process_state_id, "agent_id": self.agent_id},
status=status,
summary_merge=summary,
exit_code=exit_code,
error=error,
release_locks=True,
)
except Exception as e:
log.warning("Atomic finish ebaõnnestus — fallback non-atomic: %s", e)
# ... olemasolev 2-step kood ...
Tee: Edit 07_skriptid/eirik_os/agent_state.py + register_edit + pos_deep_memory category='otsus' ENNE muudatust + Erki kinnitus (PR #111 strateegiline — mõjutab 5+ agendi-rolli K0).
telegram_listener.py robustsem shutdownPõhjus: Windows TerminateProcess / Task Scheduler "End task" bypassi SIGTERM. atexit + SIGBREAK lisamine ennetab no-finish-call juhtumit.
import atexit
def _final_finish():
"""atexit hook — viimase rea kindlustaja kui SIGTERM/SIGINT ei tulnud."""
global state
if state and state.run_id:
try:
state.finish(status="partial", error="atexit_no_signal")
except Exception:
pass
atexit.register(_final_finish)
# Lisaks signal_handler-i SIGBREAK (Windows)
if hasattr(signal, "SIGBREAK"):
signal.signal(signal.SIGBREAK, _signal_handler)
heartbeat_watcher.kill_stale_run atomicPõhjus: 4-step sequential UPDATE (read 219-273) — kui step 2 (PSL UPDATE) fail, log.warning ainult AGA AR juba 'failed'. Eraldi orphan PSL allikas.
Kandidaat-SQL: Migration 0064+ — kill_stale_run_atomic(p_run_id, p_kill_reason, p_release_locks) RPC stiilis tier_2_5_finish:
CREATE OR REPLACE FUNCTION kill_stale_run_atomic(
p_run_id BIGINT, p_kill_reason TEXT, p_release_locks BOOL DEFAULT TRUE
) RETURNS JSONB LANGUAGE plpgsql AS $$
DECLARE
v_psl_id BIGINT;
v_agent_id TEXT;
v_locks_released INT := 0;
BEGIN
SELECT process_state_id, agent_id INTO v_psl_id, v_agent_id
FROM agent_runs WHERE id = p_run_id;
UPDATE agent_runs SET status='failed', finished_at=NOW(), error_message=p_kill_reason
WHERE id = p_run_id;
IF v_psl_id IS NOT NULL THEN
UPDATE process_state_log SET status='killed', finished_at=NOW()
WHERE id = v_psl_id;
END IF;
IF p_release_locks THEN
WITH del AS (
DELETE FROM agent_task_locks WHERE locked_run_id = p_run_id RETURNING 1
) SELECT COUNT(*) INTO v_locks_released FROM del;
END IF;
RETURN jsonb_build_object(
'run_id', p_run_id,
'psl_id', v_psl_id,
'agent_id', v_agent_id,
'locks_released', v_locks_released
);
END $$;
Esitatakse Erkile agent_messages priority=2 + pos_deep_memory category='otsus' järgmise sessiooni kandidaadina.
| Vahimees moodul | Reegel / kontroll | Eesmärk |
|---|---|---|
| M3 ComplianceAuditor | Regex SQL-päring iga 6h: orphan PSL count by component (komponent ↔ AR.status ↔ PSL.status muster) | Vali komponent kus orphan PSL > 5 / 24h → alert priority=2 Erkile |
| M6 CodeAudit | AST-skann: state.finish kutse → kontrolli kas atexit + SIGBREAK handler olemas (Windows-host) | Iga uus daemon-skript peab katma 3 shutdown-rada (SIGINT/SIGTERM/atexit) |
| M6 CodeAudit | Regex Bash/PS1/VBS: Shell.Run.*False või Start-Process -NoWait ilma -Wait flagita → CRITICAL warning | Detached spawn pattern keelustada — sub-second restart cron loob orphan AR+PSL |
| M1 HeartbeatWatcher R4 | Jätkata find_orphan_psl_zombies + kill_orphan_psl cron'iga (LIVE) | Fallback, kui algne kõik kaitsed ei tööta. Hetkel teeb selle 5min cron'iga. |
| Unit-test (CI) | tests/test_agent_state_finish_atomic.py — mock PSL UPDATE fail (HTTP 500) → assert AR EI ole 'completed' (rollback) | Migration 0060 RPC integratsioon kindlustatud testidega |
tier_2_5_helper-i (mitte vana eirik_os.agent_state-i), siis Hypothesis C ei oleks võimalik. Migration 0060 LIVE (kontrollitud information_schema.routines), AGA agent_state.py ei kasuta seda. Konsolideerumine vajalik.telegram_listener.py, heartbeat_watcher.py, agent_state.py, tier_2_5_helper.py) — ühtegi muudatust ei tehtud. read-only mode.system_edit_registry.jsonl