Production Broke. Now What?
Stay calm. Follow a systematic approach.
Step 1: Acknowledge and Communicate
"We're aware of [issue]. Investigating now."
Don't speculate. Update when you know more.
Step 2: Check the Logs
# Recent errors
tail -f /var/log/app/error.log
# Search for specific error
grep -i "error" /var/log/app/*.log | tail -50
# Laravel logs
tail -f storage/logs/laravel.log
Step 3: Reproduce the Issue
- Get exact steps from user/alert
- Check specific user, URL, or data
- Look for patterns (time, user type, feature)
Step 4: Identify the Change
# What deployed recently?
git log --oneline -20
# Diff between versions
git diff v1.2.3..v1.2.4 --stat
Step 5: Fix Forward or Rollback
Rollback if:
- Issue is severe
- Fix isn't obvious
- More investigation needed
# Revert to previous version
git revert HEAD
# or deploy previous release
Fix forward if:
- Issue is minor
- Fix is quick and obvious
- Rollback has risks
Step 6: Postmortem
Document while it's fresh:
## Incident: Login failures - Jan 15, 2024
### Timeline
- 14:30 - Alert triggered
- 14:35 - Issue confirmed
- 14:50 - Root cause identified
- 15:00 - Fix deployed
### Root Cause
Cache TTL was set to 0, causing every request to hit DB
### Resolution
Restored cache TTL to 3600 seconds
### Prevention
- Add monitoring for cache hit rate
- Review config changes before deploy
Essential Monitoring
- Error rate spikes
- Response time increases
- Database connection counts
- Queue backlogs
- Memory/CPU usage
