CMDB & Discovery
Discovery Failures
CMDB discovery failures prevent accurate CI data and can impact impact analysis, change risk scoring, and AI triage accuracy. Most discovery failures are caused by network connectivity issues, credential problems, or cloud API rate limiting. Check the Discovery Logs in CMDB → Discovery → Discovery Logs for error details.
- CloudWatch Logs:
/aws/lambda/StackFlowCloudDiscoveryand/aws/lambda/StackFlowNeptuneCMDBSeederlog groups - DynamoDB:
StackFlow_DiscoveryJobtable accessible to check last discovery run status and errors - Neptune: Graph query endpoint accessible from diagnostic tooling within VPC
- IAM:
StackFlowDiscoveryRolecross-account trust active in all connected accounts
Agent Issues
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| Agent last heartbeat > 30 min | Agent service stopped or network issue | systemctl status stackflow-agent on target host | Restart agent service; check firewall rules for HTTPS outbound |
| Agent showing wrong hostname | Hostname changed after agent install | Check /etc/stackflow-agent/config.yaml for override | Update hostname_override in config; restart agent |
| Agent token expired | Token revoked or rotated | Check token status in Admin → Discovery → Agent Tokens | Generate new token, update config.yaml on target host |
| Hardware inventory incomplete | Agent lacks root/admin access | Check agent logs for "Permission denied" errors | Run agent as root (Linux) or SYSTEM (Windows) |
Cloud Discovery Issues
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| AWS account shows 0 CIs | Cross-account role trust policy wrong | Test role assumption: aws sts assume-role --role-arn arn:aws:iam::{account}:role/StackFlowDiscoveryRole --role-session-name test | Fix trust policy external ID condition |
| Azure discovery fails with 401 | Azure service principal secret expired | Check expiry in Azure Portal → App Registrations → Certificates & Secrets | Rotate client secret, update Secrets Manager entry |
| Partial AWS discovery (some services missing) | Missing IAM permissions for specific services | Check Lambda CloudWatch logs for AccessDenied during discovery | Add missing read permissions to StackFlowDiscoveryPolicy |
/aws/lambda/StackFlowAPI filtered by discovery. Each discovery run logs start, per-service counts, any errors, and total CI count at completion.
Neptune Sync Issues
aws lambda invoke --function-name StackFlowNeptuneCMDBSeeder --payload '{"mode": "delta", "dry_run": true}' --region us-east-1 output.json
cat output.json
If the seeder Lambda is failing, check its CloudWatch logs. Common issues include Neptune connection timeouts (check Neptune cluster status), Gremlin serialization errors on malformed CI data, and Lambda timeout when syncing very large CI sets (reduce batch size in seeder configuration).
CMDB Data Quality
Data quality issues manifest as incorrect impact analysis results, wrong CI counts in dashboards, or duplicate CI records. The CMDB health dashboard at CMDB → Health shows: duplicate detection results, orphaned CIs (no relationships), stale CIs (no update in 30+ days), and missing mandatory attributes. Address these proactively to maintain AI analysis quality.