Agent Registry Architecture
The Agent Registry is a three-layered system for managing agent lifecycle, discovery, and health monitoring across the platform.
🗄 Registry Management Layer
DynamicAgentRegistry
In-Memory Maps
DynamoDB Persistence
EventBridge Events
🔎 Discovery & Routing Layer
AgentRegistry
DynamicAgentLoader
HotReloadManager
ManifestLoader
📊 Monitoring & Capability Layer
HealthMonitor
CapabilityIndex
HeartbeatPublisher
ServiceRegistry
55+
Registered Agents
4
Discovery Sources
4
Index Dimensions
30s
Hot Reload Interval
🏗
Key Design Patterns
Production-grade agent orchestration
💾 Dual Storage
Hot (memory) + Cold (DynamoDB)
🔍 Multi-Index
4 views of capability data
📡 Event-Driven
Reactive status propagation
💚 Health-Aware
Route only to healthy agents
💡 Why a Registry?
The registry enables dynamic agent orchestration — agents can be added, removed, or updated without redeploying the platform. It provides fast capability lookup, health-aware routing, and real-time configuration updates via hot reload.
Agent Registration
Agents register via two approaches: programmatic registration at runtime, or manifest-based discovery from configuration files.
📝
Programmatic Registration
Runtime agent registration
Agent instances register themselves with the registry at startup. The registry stores the instance, persists metadata, and publishes lifecycle events.
const registry = new DynamicAgentRegistry(config);
await registry.registerAgent(agentInstance);

// Internally:
// 1. Store in Map<agentId, AgentBase>
// 2. Create RegisteredAgent metadata
// 3. Persist to DynamoDB
// 4. Publish 'AgentRegistered' event
// 5. Set up status/capacity listeners
// 6. Initialize the agent
📄
Manifest-Based Registration
Declarative agent configuration
Agents can also be discovered via manifests that declare capabilities, dependencies, configuration, and monitoring requirements.
id string Unique agent identifier
name string Human-readable name
type enum core | specialist | coordinator | utility
capabilities object Core capabilities + enhancements
dependencies object Required services and agents
configuration object Environment, features, limits
monitoring object Metrics and health checks
📊
Registered Agent Metadata
What gets stored for each agent
interface RegisteredAgent {
  id: string;
  name: string;
  type: string;                // "supervisor", "general-agent"
  status: 'available' | 'unavailable' | 'degraded';
  capabilities: string[];       // Array of capability IDs
  config: AgentConfig;
  capacity: {
    maxConcurrent: number;
    current: number;
  };
  metrics: {
    requestsHandled: number;
    avgResponseTime: number;
    errorRate: number;
  };
  lastHealthCheck: Date;
  responseQueue: AgentQueueMetadata;
}
Capability Index
The CapabilityIndex maintains 4 bidirectional indices for fast, efficient lookup of agents by their capabilities.
4 Bidirectional Indices
Capability → Agents
scheduling → calendar, task
Agent → Capabilities
calendar-agent → scheduling, analysis
Domain → Agents
calendar → calendar, scheduling
Action → Agents
schedule-meeting → calendar, meeting
🔎
Query Operations
Fast capability searches
// Find agents with single capability
findByCapability('scheduling'): string[]

// Find agents with ALL capabilities (intersection)
findByCapabilities(['scheduling', 'email']): string[]

// Complex search with modes
search({
  capabilities: ['travel-planning', 'booking'],
  domains: ['weather'],
  mode: 'any'  // 'any' = union, 'all' = intersection
}): string[]

// Get index statistics
getStats(): {
  totalAgents, totalCapabilities,
  totalActions, totalDomains,
  capabilityDistribution, actionDistribution
}
O(1) Lookups
All indices are backed by Map<string, Set<string>> structures, providing constant-time lookups. The multi-index design allows queries from any dimension — find by capability, action, domain, or agent ID.
🧪
Example: Travel Query
Finding the right agents for a complex task
Query
"Plan my trip to Rome with flights and weather"
Capability Search
search({ capabilities: ['travel-planning', 'booking'], domains: ['weather'], mode: 'any' })
Matched Agents
travel-integrator mobility-agent weather-agent
Health Monitoring
Three-tier health checking system validates Lambda functions, HTTP endpoints, and container services.
1
Lambda Function Health
Checks function state via AWS SDK
State = 'Active', LastUpdateStatus = 'Successful'
2
HTTP Endpoint Health
GET request to /health endpoint
Status 200, timeout 10s
3
Container (ECS) Health
DescribeServicesCommand validation
running > 0, pending = 0, status = 'ACTIVE'
Healthy
2+ consecutive successes
Degraded
Partial failures detected
Unhealthy
3+ consecutive failures
📊
Health Status Tracking
Per-agent health history
interface HealthStatus {
  agentId: string;
  consecutiveFailures: number;
  consecutiveSuccesses: number;
  lastCheck: HealthCheckResult;
  history: HealthCheckResult[];  // Last 10 checks
}

interface HealthCheckResult {
  agentId: string;
  healthy: boolean;
  timestamp: number;
  responseTime?: number;
  error?: string;
  details?: Record<string, any>;
}
💓
Heartbeat Publishing
Agent liveness signals to DynamoDB
Agents publish heartbeats to the registry table, enabling distributed health tracking and CloudWatch metrics emission.
interface HeartbeatPayload {
  agentId: string;
  agentType: string;
  status: 'healthy' | 'unhealthy' | 'degraded';
  health?: {
    status: string;
    lastCheck?: string;
    checks?: Array<{ name, status, message }>;
  };
  metrics?: Record<string, any>;
  ttlSeconds?: number;
}
Agent Discovery
The DynamicAgentLoader pulls configurations from 4 sources, with later sources overriding earlier ones.
📦
S3 Manifests
Priority 1
JSON manifest files stored in S3 bucket. Primary source for production agent configurations.
s3://agent-manifests/manifests/{agentId}/manifest.json
📁
Filesystem
Priority 2
Agent implementations directory with pre-registered agent types for local development.
agents/implementations/{agentType}/
🗃
DynamoDB
Priority 3
Registry table scan for enabled agents. Includes heartbeat and runtime status.
ai-pa-agent-registry-{env}
🌐
External API
Priority 4
Remote registry API endpoint for cross-region or external agent discovery.
$AGENT_REGISTRY_API_ENDPOINT
🔄 Merge Strategy
Configurations are merged in priority order — API overrides DynamoDB, DynamoDB overrides Filesystem, Filesystem overrides S3. This allows runtime overrides without changing base manifests.
📡
EventBridge Integration
Registry lifecycle events
AgentRegistered New agent joined the registry
AgentUpdated Agent config or status changed
AgentUnregistered Agent removed from registry
AgentStatusUpdated Health status transition
AgentMetricsUpdated Performance metrics published
Hot Reload
The HotReloadManager polls S3 every 30 seconds for manifest changes, enabling real-time configuration updates without restart.
1
List Manifests
List all manifest files in S3 bucket, capturing ETags for change detection
2
Compare Snapshots
Compare current ETags against previous poll to detect added, updated, or removed files
3
Fetch Changes
Download content for new or updated manifests from S3
4
Emit Events
Publish 'added', 'updated', or 'removed' events for each detected change
5
Update Registry
Registry listeners react to events, updating in-memory state and capability indices
📢
Change Events
EventEmitter interface
// Subscribe to hot reload events
hotReloadManager.on('added', (change) => {
  console.log(`New agent: ${change.agentId}`);
});

hotReloadManager.on('updated', (change) => {
  console.log(`Updated: ${change.agentId}`);
  console.log(`Previous: ${change.previousVersion}`);
  console.log(`New: ${change.newVersion}`);
});

hotReloadManager.on('removed', (change) => {
  console.log(`Removed: ${change.agentId}`);
});

interface HotReloadChange {
  type: 'added' | 'updated' | 'removed';
  agentId: string;
  manifest?: AgentManifestContent;
  previousVersion?: string;
  newVersion?: string;
  timestamp: Date;
}
30s
Poll Interval
ETag
Change Detection
0
Downtime
🚀 Zero-Downtime Updates
Hot reload enables configuration changes without restarting the platform. Update an agent's capabilities, add new agents, or remove deprecated ones — all by modifying S3 manifests. Changes propagate within 30 seconds.
📊
Status Tracking
Monitor hot reload health
getStatus(): {
  enabled: boolean;
  isPolling: boolean;
  lastPollAt?: Date;
  lastChangeAt?: Date;
  totalAgents: number;
  pollCount: number;
  errorCount: number;
}