AWS Infrastructure
Infrastructure Summary
StackFlow's complete AWS infrastructure is deployed in us-east-1 under account 373544523367. All components run within a private VPC with no public IP exposure except through CloudFront and API Gateway endpoints. Every data store is encrypted with CMK mrk-bd842691514c4d74a02992b8dc11fe16.
- Lambda:
StackFlowAPI(nodejs22.x, arm64, 1792MB, 300s) in VPCvpc-0c4e3c18734dee8f7 - Aurora PostgreSQL 16:
stackflow-main-prodMulti-AZ cluster with IAM auth and encryption enabled - Neptune 1.4.7:
stackflow-knowledge-graphcluster with IAM auth, Serverless 1-8 NCU - ElastiCache:
stackflow-redis-prod(cache.t4g.micro) with TLS, auth token, Multi-AZ - KMS CMK:
mrk-bd842691514c4d74a02992b8dc11fe16enabled and key policy allowing all StackFlow roles
| Resource | Identifier | Purpose |
|---|---|---|
| VPC | vpc-0c4e3c18734dee8f7 | Network isolation for all StackFlow resources |
| Subnet 1a | subnet-05eae5f255dec054f | us-east-1a private subnet |
| Subnet 1b | subnet-03ab773ce82d704d1 | us-east-1b private subnet |
| Security Group | sg-0ada825cda6a75ed6 | StackFlow application tier SG |
| API Gateway | uazcuhdus2 | REST API entry point |
| CloudFront | E1UTZ9SVSR2WGV | Docs site CDN |
Lambda Functions
| Function Name | Runtime | Memory | Purpose |
|---|---|---|---|
StackFlowAPI | nodejs22.x, arm64 | 1792 MB | Main API handler (300s timeout) |
StackFlowCacheWarmer | py3.12, arm64 | 512 MB | Redis + Bedrock cache pre-warming |
StackFlowNeptuneCMDBSeeder | nodejs22.x | 512 MB | Sync CMDB to Neptune graph |
StackFlowGitHubSync | nodejs22.x | 512 MB | GitHub webhook handler |
StackFlowSecretsRotation | py3.12 | 256 MB | Rotate StackFlow-specific secrets |
StackFlowGenericSecretRotation | py3.12 | 256 MB | Rotate external API keys |
StackFlowFieldKeyRotator | py3.12 | 512 MB | Rotate field-level encryption keys |
StackFlowPatcher | py3.12 | 512 MB | Apply schema and data patches |
Database Layer
Aurora PostgreSQL 16 (Main)
Endpoint: stackflow-main-prod.cluster-c6pq0smgmlri.us-east-1.rds.amazonaws.com
Database: stackflow | Port: 5432
Aurora PostgreSQL 17 (Requirements)
Endpoint: stackflow-req-prod.cluster-c6pq0smgmlri.us-east-1.rds.amazonaws.com
Port: 5432
Neptune (Knowledge Graph)
Endpoint: stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com
Port: 8182 (WebSocket/Gremlin)
ElastiCache Redis (Cache)
Endpoint: master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com
Port: 6379 | TLS: Yes | Auth: Token
Networking
All StackFlow traffic flows through CloudFront → API Gateway → Lambda. The Lambda functions use VPC configuration to access Aurora, Neptune, and Redis within the private subnet. No data leaves the VPC except through NAT Gateway (for external API calls) and VPC Interface Endpoints (for AWS service API calls without internet exposure).
Queue Architecture
Four SQS FIFO queues handle different priority tiers of event processing: StackFlow-Events-Ingestion.fifo for general ITSM events (email-to-ticket, webhook ingestion), StackFlow-Remediation-P1.fifo for critical auto-remediation actions, StackFlow-Remediation-Standard.fifo for non-critical remediation, and StackFlow-Remediation-DLQ.fifo for failed messages requiring manual review. All queues use server-side encryption with the CMK.
Diagnostic Scripts
#!/usr/bin/env python3
"""Test Aurora PostgreSQL connectivity from within VPC."""
import psycopg2
import boto3, json
def get_secret(secret_id):
sm = boto3.client('secretsmanager', region_name='us-east-1')
return json.loads(sm.get_secret_value(SecretId=secret_id)['SecretString'])
creds = get_secret('stackflow/aurora-db-credentials')
try:
conn = psycopg2.connect(
host='stackflow-main-prod.cluster-c6pq0smgmlri.us-east-1.rds.amazonaws.com',
database='stackflow', user=creds['username'], password=creds['password'],
port=5432, connect_timeout=10
)
with conn.cursor() as cur:
cur.execute("SELECT version();")
print("Aurora version:", cur.fetchone()[0])
cur.execute("SELECT count(*), state FROM pg_stat_activity GROUP BY state;")
print("\nConnection pool:")
for row in cur.fetchall():
print(f" {row[1]}: {row[0]} connections")
conn.close()
print("\nAurora connectivity: OK")
except Exception as e:
print(f"Aurora connectivity FAILED: {e}")
# Invoke StackFlowAPI Lambda with a test health check payload
aws lambda invoke \
--function-name StackFlowAPI \
--payload '$(echo -n '"'"'{"path":"/prod/api/health","httpMethod":"GET","headers":{"Authorization":"Bearer test"}}'"'"' | base64)' \
--cli-binary-format raw-in-base64-out \
--region us-east-1 \
/tmp/lambda-response.json
cat /tmp/lambda-response.json | python3 -m json.tool
# Check Lambda configuration
aws lambda get-function-configuration \
--function-name StackFlowAPI \
--query '{Runtime:Runtime,MemorySize:MemorySize,Timeout:Timeout,Arch:Architectures}' \
--region us-east-1
# CloudWatch Insights -- API errors by path (last 1 hour)
# Run in CloudWatch Insights console against /aws/lambda/StackFlowAPI
fields @timestamp, @message
| filter @message like /ERROR|statusCode.*[45][0-9][0-9]/
| parse @message '"path":"*"' as path
| parse @message '"statusCode":*,' as statusCode
| stats count() as errorCount by path, statusCode
| sort errorCount desc
| limit 20
# Lambda cold starts (last 24 hours)
fields @timestamp, @message, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart, max(@initDuration) as maxColdStart, count() as count
| sort count desc