cloud phone fleet monitoring dashboard guide
cloud phone fleet monitoring dashboard guide
cloud phone fleet monitoring in 2026 turns a black-box subscription into a glass-box infrastructure. instead of finding out a phone is degraded when an engineer files a ticket, you see it on a dashboard and act before users notice. cloudf.one ships a fleet dashboard with the metrics that matter, plus the export hooks to mirror them into Datadog, Grafana, or your in-house observability stack. this guide walks through what to monitor, what alerts to set up, and how to drill from a red tile down to the device that needs attention.
if you are building admin discipline more broadly, team seats, RBAC, and audit logs all assume the fleet itself is healthy. this article is how you keep it that way.
what to monitor
four metric families.
- availability: how many devices are healthy and ready to lock
- utilization: how many devices are in use vs idle, by team and tag
- performance: how long lock-to-ADB-ready takes, by region
- errors: failed locks, ADB disconnects, crashes, reboots, rate-limit hits
a useful default dashboard fits all four families on one screen.
[SCREENSHOT: fleet monitoring dashboard with 4 quadrants showing availability, utilization, performance, errors over the last 24 hours]
the availability tile
shows fleet-wide health.
- total devices in fleet
- available (idle, healthy, ready to lock)
- in use (currently locked)
- degraded (healthy enough to lock but flagging warnings)
- offline (network loss, ADB disconnect, scheduled maintenance)
green means >90% available + in use. yellow means 70-90%. red means <70%, dig in immediately.
[SCREENSHOT: availability tile with stacked bar chart, green/yellow/red device counts, click-through link to device list]
drill into any state to see the device list. for offline devices, the dashboard shows last-seen timestamp, region, model, and a quick action to reboot or escalate to support.
the utilization tile
shows how busy your fleet is.
- account-wide utilization percentage (devices in use / total)
- per-team utilization (which teams are heavy users)
- per-tag utilization (qa-pool vs dev-pool vs prod-pool)
- peak vs average over the day
[SCREENSHOT: utilization tile with line chart over 24h, per-team breakdown, peak hour indicator]
watch for two patterns.
- chronic over-utilization (>85% peak): you need more devices or pool capacity. the fleet is constraining the team.
- chronic under-utilization (<40% peak): you are paying for capacity nobody uses. consider shrinking the pool.
most healthy fleets sit at 50-75% peak utilization with room to grow.
the performance tile
shows latency at three percentiles.
- lock-to-ready p50: median time from lock API call to ADB ready
- lock-to-ready p95: 95th percentile, the slow tail
- lock-to-ready p99: 99th percentile, the worst case
[SCREENSHOT: performance tile with three line charts at p50, p95, p99 over 24h, target lines drawn]
healthy targets in 2026.
- p50 < 5 seconds
- p95 < 15 seconds
- p99 < 30 seconds
if p95 or p99 is rising, the fleet is degrading. usually it means devices are queueing because availability is too low, or because a region is under network load.
the errors tile
shows failed events by type.
- failed locks (device went offline mid-lock, returned 503)
- ADB disconnects (network loss during a session)
- crashes (app or system)
- reboots (scheduled or unscheduled)
- rate-limit hits (API token over quota)
[SCREENSHOT: errors tile with stacked bar chart showing each error type per hour over 24h, with red threshold line]
healthy fleets have a baseline error rate (<1% of operations). an order-of-magnitude jump is a real problem. cloudf.one alerts on anomalies automatically; you can also configure your own thresholds via the webhook automation pattern.
drill-down workflows
three common drill-downs.
drill 1: offline devices
dashboard shows 8 offline devices. click the offline tile, see the list, sort by last-seen.
[SCREENSHOT: offline device list with model, region, last-seen, action buttons]
three states.
- offline <5 min: probably transient network, wait
- offline 5-30 min: try the reboot action
- offline >30 min: file a ticket with cloudf.one support
drill 2: slow region
p95 spiked from 12s to 28s in the last hour. click the perf tile, switch breakdown to per-region, see SG region is the culprit.
[SCREENSHOT: per-region performance breakdown, SG region in red, EU and US green]
two next steps.
- check the cloudf.one status page for SG region incidents
- check your own CI runner location, in case it is your network not theirs
drill 3: error storm
errors tile shows 200+ failed locks in 10 minutes from a single API token. drill into errors, group by api_token, identify the offender.
[SCREENSHOT: errors grouped by api_token, one token responsible for 200+ failures, action menu with rate-limit, revoke, contact-owner options]
three actions.
- contact the owner via Slack
- temporarily lower the token’s rate limit
- if no response in 15 min, revoke the token
alerts to wire up
four alerts every team should have on, all routable to your Slack/Discord/Telegram channels.
| alert | threshold | severity |
|---|---|---|
| availability < 80% | 5 min sustained | high |
| utilization > 90% | 30 min sustained | medium |
| p95 latency > 20s | 15 min sustained | medium |
| error rate > 5% of operations | 10 min sustained | high |
route via the webhook automation, surface in your existing ops chat, page on-call for the high-severity ones.
exporting metrics to Datadog or Grafana
three export options.
option 1: native integrations
cloudf.one ships native Datadog and Grafana Cloud integrations as of 2026. one-click in the integrations page, paste API key, metrics flow within 5 minutes.
[SCREENSHOT: integrations page with Datadog, Grafana, Prometheus, New Relic tiles, click to configure]
option 2: Prometheus endpoint
GET /metrics returns Prometheus-formatted metrics. point your scraper at it. works with any Prometheus-compatible system (Mimir, Cortex, VictoriaMetrics).
- job_name: cloudfone
scheme: https
static_configs:
- targets: ['api.cloudf.one']
metrics_path: /metrics
bearer_token: $CLOUDFONE_TOKEN
scrape_interval: 60s
option 3: webhook + custom processor
for stacks that do not support Prometheus, use webhook events as the source. process and forward to your stack of choice.
most teams use option 1 if they are on Datadog or Grafana, option 2 otherwise.
SLO setup
once you have monitoring, set service-level objectives that match your contract. examples.
- availability SLO: 99.5% over rolling 30 days (per SLA expectations)
- p95 lock-to-ready SLO: under 15 seconds 95% of the time
- error rate SLO: under 1% of operations
these become the burn-rate alerts that fire only when you are actually trending toward SLA breach, not on every transient blip.
the daily 5-minute monitoring routine
every morning, 5 minutes.
- open the dashboard, scan all four tiles
- check overnight errors and any auto-paged alerts
- peek at projected month-end usage (utilization vs capacity)
- close any acknowledged alerts in your ops channel
- file an action item for any drift you saw
teams that do this catch 80% of issues before users do.
frequently asked questions
how far back does the fleet dashboard data go?
12 months on paid plans, with 1-minute granularity for the last 24 hours and 5-minute granularity beyond. exports go to your stack for longer retention.
can I customize the dashboard layout?
partially. you can rearrange the four tiles, add up to 4 custom widgets, and save the layout per user. full custom dashboards live in your Datadog/Grafana export.
what is the latency on alerts firing?
real-time alerts trigger within 30 seconds of the threshold breach. webhook delivery is typically under 5 seconds after that.
does cloudf.one offer SLO-as-code?
yes via the API. you can define SLOs in YAML, push via POST /slos, and get burn-rate alerts back through the same webhook channel.
can multiple admins share dashboard alerts?
yes. configure alerts at the account level (everyone with audit:view permission gets them) or scope to specific roles via the alert configuration page.
ready to make your fleet observable? start a cloudf.one trial, open the fleet dashboard, and watch the metrics flow in real time as you lock your first device.