Quality Monitoring & Drift Detection
Three complementary systems ensure data quality, goal decomposition quality,
and persona consistency across multi-turn conversations.
🐶
Data Validator
"Data Dog"
→
🏆
Decomposition Evaluator
Quality Gates
→
🧬
Persona Drift
Consistency
6
Validation Checks
5
Quality Criteria
5
Persona Traits
3
EventBridge Events
Three Monitoring Systems
Each addresses a different quality dimension
Data Validator Agent
Production real-time validation
Validates tracking data before presentation. Detects duplicates, anomalies, calculation errors, gaps, and freshness issues.
Decomposition Evaluator
Goal breakdown quality
Evaluates task decomposition across 5 criteria. Uses Canvas-based collaboration for iterative refinement.
Persona Drift Analyzer
Research/testing tool
Measures persona consistency across multi-turn conversations. Tracks 5 traits with statistical analysis.
💡
LLM + Heuristic Hybrid
All three systems use a hybrid approach: fast heuristic checks run first for quick feedback,
then LLM-based evaluation provides deeper analysis. This balances speed (~300ms for heuristics)
with accuracy (LLM catches edge cases).
Data Validator Agent
The "Data Dog" validates tracking data before presenting it to users.
Six modular checks run in parallel for performance, producing a confidence score.
📊
Tracking Data
Up to 1,000 entries
→
🐶
Data Validator
Parallel checks
→
✅
Confidence Level
high/medium/low
Calculation Check
Verifies math: unit price x qty = total, sum validations
Weight: 2.0x
Duplicate Check
Finds duplicate or similar entries with similarity scoring
Weight: 1.5x
Anomaly Check
Z-score statistical analysis for unusual patterns
Weight: 1.2x
Completeness Check
Detects missing data gaps in date ranges
Weight: 1.0x
Categorization Check
Validates category confidence and consistency
Weight: 1.0x
Freshness Check
Detects stale data, future dates, sync issues
Weight: 0.8x
Validation Modes
Choose speed vs. thoroughness
Full Validation
All 6 checks, detailed results
~1-2s
Quick Check
Duplicate + Calculation only
~300-500ms
Single Check
Run specific check in isolation
~100-200ms
// Parallel execution of all checks const checkResults = await Promise.allSettled([ completenessCheck.run(entries, context), duplicateCheck.run(entries, context), categorizationCheck.run(entries, context), calculationCheck.run(entries, context), freshnessCheck.run(entries, context), anomalyCheck.run(entries, context), ]); // Weighted score aggregation const weights = { calculation: 2.0, // Math errors are serious duplicate: 1.5, // Duplicates affect totals anomaly: 1.2, // Unusual patterns need attention completeness: 1.0, // Standard importance categorization: 1.0, freshness: 0.8, // Less critical };
Decomposition Evaluator
Evaluates goal decomposition quality across 5 criteria.
Uses Canvas-based collaboration for iterative refinement until quality thresholds are met.
0.88
Granularity
15min-4hr tasks
0.92
Realism
Feasible estimates
0.75
Coverage
Achieves goal
0.85
Progression
Learning curve
0.90
Actionability
Clear steps
Quality Thresholds
Default evaluation criteria
minOverallScore
0.7
Minimum 70% quality
maxHoursPerSubtask
4
No task longer than 4 hours
minMinutesPerSubtask
15
No task shorter than 15 minutes
minEstimatesCoverage
0.8
80% of tasks need time estimates
maxParallelSubtasks
3
Maximum 3 concurrent tasks
Issue Types Detected
Heuristic + LLM detection
compressed_time
vague_definition
missing_progression
too_large
too_small
missing_dependencies
incomplete_coverage
unrealistic_parallel
missing_milestone
🔄
Iterative Refinement via Canvas
When quality falls below threshold, the evaluator posts a
QUESTION annotation
to the Canvas requesting refinement. The TaskAgent sees the feedback, revises the decomposition,
and the evaluator re-evaluates. This continues up to 5 iterations until
quality passes, then an AGENT_INSIGHT approval is posted.
// Evaluation pipeline async evaluate(decomposition: TaskDecomposition): EvaluationResult { // Fast heuristics first (~50ms) const heuristicIssues = this.runHeuristicChecks(decomposition); // LLM evaluation for deeper analysis (~500ms) const llmAssessment = await this.llmEvaluate(decomposition, { model: modelTiers.getModelId('reasoning'), temperature: 0.3, // Consistent evaluation }); // Merge results and determine action return this.mergeAssessments(heuristicIssues, llmAssessment); }
Persona Drift Analyzer
A research/testing tool that measures persona consistency across multi-turn conversations.
Tracks 5 personality traits with statistical analysis to detect drift from baseline.
Warmth
Reserved
Warm
Formality
Casual
Formal
Brevity
Verbose
Concise
Proactiveness
Passive
Proactive
Empathy
Neutral
Empathetic
Multi-Turn Drift Analysis
1
Baseline
1.00
3
Turn
0.92
5
Turn
0.88
7
Turn
0.82
Drift threshold: 0.85 (15% tolerance)
Drift Detection Metrics
Statistical analysis per trait
Drift from Baseline
Absolute difference from turn 1
Standard Deviation
Consistency measure per trait
Maximum Drift
Worst-case deviation observed
Overall Consistency
0-1 score across all traits
📝
Reinforcement Recommendations
Based on drift analysis, the system recommends persona reinforcement frequency:
- No drift: "Current implementation sufficient"
- Moderate drift (8-15%): "Reinforce every 5 turns"
- Significant drift (>15%): "Reinforce every 3 turns"
Confidence Scoring
Each system produces a confidence level through weighted aggregation of check results.
The Data Validator uses a particularly sophisticated weighting scheme.
Overall Confidence
HIGH
Score: 0.91
HIGH
≥ 0.9
No errors
≤ 3 warnings
≤ 3 warnings
MEDIUM
0.7 - 0.9
Score below 0.9
OR > 3 warnings
OR > 3 warnings
LOW
< 0.7
Any errors
OR score below 0.7
OR score below 0.7
Weighted Aggregation
Check importance varies by impact
Calculation
2.0x
Duplicate
1.5x
Anomaly
1.2x
Completeness
1.0x
Categorization
1.0x
Freshness
0.8x
Why different weights? Math errors (Calculation) directly affect totals shown to users.
Duplicates inflate counts. Anomalies need investigation but may be legitimate.
Freshness is less critical than accuracy.
// Confidence level determination function determineConfidence(results: CheckResults): ConfidenceLevel { const errorCount = results.filter(r => r.severity === 'error').length; const warningCount = results.filter(r => r.severity === 'warning').length; const weightedScore = calculateWeightedScore(results); // Any errors = low confidence if (errorCount > 0) return 'low'; // Score below threshold = low confidence if (weightedScore < 0.7) return 'low'; // Many warnings or medium score = medium confidence if (warningCount > 3 || weightedScore < 0.9) return 'medium'; // High score with few issues = high confidence return 'high'; }
Events & Alerts
All monitoring systems integrate with EventBridge for real-time alerting
and downstream processing. Critical issues trigger immediate notifications.
validation.started
Emitted when validation begins. Includes trackingId, userId, entryCount, checkTypes.
validation.completed
Emitted when validation finishes. Includes confidence level, overall score, issue count, duration.
validation.issue.detected
Fired for each critical issue (severity='error'). Used for alerting and remediation workflows.
Event Payload Structure
Standardized for downstream processing
interface ValidationEvent { source: 'data-validator-agent'; detailType: 'validation.completed'; detail: { trackingId: string; userId: string; correlationId: string; confidence: 'high' | 'medium' | 'low'; overallScore: number; issueCount: { errors: number; warnings: number; }; checksRun: string[]; durationMs: number; }; }
Alerting Integration
SNS topics for notifications
SNS Topic
ai-pa-{env}-synthetics-alerts
Subscribe
./ops alerts subscribe --env dev
Alert Triggers:
Low confidence result
Critical calculation error
Data freshness > 24h
Duplicate rate > 10%
📊
Observability
All three systems use structured correlation logging:
- Agent name and ID for filtering
- Correlation ID for request tracing
- Operation context and performance metrics
- CloudWatch Metrics namespace:
DataValidator
Key File Locations
Where to find these systems
Data Validator
agents/implementations/data-validator/
Decomposition Evaluator
agents/implementations/decomposition-evaluator/
Persona Drift
backend/src/services/persona/__tests__/persistence/
API Integration
backend/src/services/tracking/TrackingAggregationService.ts