cloud phone fleet monitoring dashboard guide

May 07, 2026

cloud phone fleet monitoring dashboard guide

cloud phone fleet monitoring in 2026 turns a black-box subscription into a glass-box infrastructure. instead of finding out a phone is degraded when an engineer files a ticket, you see it on a dashboard and act before users notice. cloudf.one ships a fleet dashboard with the metrics that matter, plus the export hooks to mirror them into Datadog, Grafana, or your in-house observability stack. this guide walks through what to monitor, what alerts to set up, and how to drill from a red tile down to the device that needs attention.

if you are building admin discipline more broadly, team seats, RBAC, and audit logs all assume the fleet itself is healthy. this article is how you keep it that way.

what to monitor

four metric families.

availability: how many devices are healthy and ready to lock
utilization: how many devices are in use vs idle, by team and tag
performance: how long lock-to-ADB-ready takes, by region
errors: failed locks, ADB disconnects, crashes, reboots, rate-limit hits

a useful default dashboard fits all four families on one screen.

[SCREENSHOT: fleet monitoring dashboard with 4 quadrants showing availability, utilization, performance, errors over the last 24 hours]

the availability tile

shows fleet-wide health.

total devices in fleet
available (idle, healthy, ready to lock)
in use (currently locked)
degraded (healthy enough to lock but flagging warnings)
offline (network loss, ADB disconnect, scheduled maintenance)

green means >90% available + in use. yellow means 70-90%. red means <70%, dig in immediately.

[SCREENSHOT: availability tile with stacked bar chart, green/yellow/red device counts, click-through link to device list]

drill into any state to see the device list. for offline devices, the dashboard shows last-seen timestamp, region, model, and a quick action to reboot or escalate to support.

the utilization tile

shows how busy your fleet is.

account-wide utilization percentage (devices in use / total)
per-team utilization (which teams are heavy users)
per-tag utilization (qa-pool vs dev-pool vs prod-pool)
peak vs average over the day

[SCREENSHOT: utilization tile with line chart over 24h, per-team breakdown, peak hour indicator]

watch for two patterns.

chronic over-utilization (>85% peak): you need more devices or pool capacity. the fleet is constraining the team.
chronic under-utilization (<40% peak): you are paying for capacity nobody uses. consider shrinking the pool.

most healthy fleets sit at 50-75% peak utilization with room to grow.

the performance tile

shows latency at three percentiles.

lock-to-ready p50: median time from lock API call to ADB ready
lock-to-ready p95: 95th percentile, the slow tail
lock-to-ready p99: 99th percentile, the worst case

[SCREENSHOT: performance tile with three line charts at p50, p95, p99 over 24h, target lines drawn]

healthy targets in 2026.

p50 < 5 seconds
p95 < 15 seconds
p99 < 30 seconds

if p95 or p99 is rising, the fleet is degrading. usually it means devices are queueing because availability is too low, or because a region is under network load.

the errors tile

shows failed events by type.

failed locks (device went offline mid-lock, returned 503)
ADB disconnects (network loss during a session)
crashes (app or system)
reboots (scheduled or unscheduled)
rate-limit hits (API token over quota)

[SCREENSHOT: errors tile with stacked bar chart showing each error type per hour over 24h, with red threshold line]

healthy fleets have a baseline error rate (<1% of operations). an order-of-magnitude jump is a real problem. cloudf.one alerts on anomalies automatically; you can also configure your own thresholds via the webhook automation pattern.

drill-down workflows

three common drill-downs.

drill 1: offline devices

dashboard shows 8 offline devices. click the offline tile, see the list, sort by last-seen.

[SCREENSHOT: offline device list with model, region, last-seen, action buttons]

three states.

offline <5 min: probably transient network, wait
offline 5-30 min: try the reboot action
offline >30 min: file a ticket with cloudf.one support

drill 2: slow region

p95 spiked from 12s to 28s in the last hour. click the perf tile, switch breakdown to per-region, see SG region is the culprit.

[SCREENSHOT: per-region performance breakdown, SG region in red, EU and US green]

two next steps.

check the cloudf.one status page for SG region incidents
check your own CI runner location, in case it is your network not theirs

drill 3: error storm

errors tile shows 200+ failed locks in 10 minutes from a single API token. drill into errors, group by api_token, identify the offender.

[SCREENSHOT: errors grouped by api_token, one token responsible for 200+ failures, action menu with rate-limit, revoke, contact-owner options]

three actions.

contact the owner via Slack
temporarily lower the token’s rate limit
if no response in 15 min, revoke the token

alerts to wire up

four alerts every team should have on, all routable to your Slack/Discord/Telegram channels.

alert	threshold	severity
availability < 80%	5 min sustained	high
utilization > 90%	30 min sustained	medium
p95 latency > 20s	15 min sustained	medium
error rate > 5% of operations	10 min sustained	high

route via the webhook automation, surface in your existing ops chat, page on-call for the high-severity ones.

exporting metrics to Datadog or Grafana

three export options.

option 1: native integrations

cloudf.one ships native Datadog and Grafana Cloud integrations as of 2026. one-click in the integrations page, paste API key, metrics flow within 5 minutes.

[SCREENSHOT: integrations page with Datadog, Grafana, Prometheus, New Relic tiles, click to configure]

option 2: Prometheus endpoint

GET /metrics returns Prometheus-formatted metrics. point your scraper at it. works with any Prometheus-compatible system (Mimir, Cortex, VictoriaMetrics).

- job_name: cloudfone
  scheme: https
  static_configs:
    - targets: ['api.cloudf.one']
  metrics_path: /metrics
  bearer_token: $CLOUDFONE_TOKEN
  scrape_interval: 60s

option 3: webhook + custom processor

for stacks that do not support Prometheus, use webhook events as the source. process and forward to your stack of choice.

most teams use option 1 if they are on Datadog or Grafana, option 2 otherwise.

SLO setup

once you have monitoring, set service-level objectives that match your contract. examples.

availability SLO: 99.5% over rolling 30 days (per SLA expectations)
p95 lock-to-ready SLO: under 15 seconds 95% of the time
error rate SLO: under 1% of operations

these become the burn-rate alerts that fire only when you are actually trending toward SLA breach, not on every transient blip.

the daily 5-minute monitoring routine

every morning, 5 minutes.

open the dashboard, scan all four tiles
check overnight errors and any auto-paged alerts
peek at projected month-end usage (utilization vs capacity)
close any acknowledged alerts in your ops channel
file an action item for any drift you saw

teams that do this catch 80% of issues before users do.

frequently asked questions

how far back does the fleet dashboard data go?

12 months on paid plans, with 1-minute granularity for the last 24 hours and 5-minute granularity beyond. exports go to your stack for longer retention.

can I customize the dashboard layout?

partially. you can rearrange the four tiles, add up to 4 custom widgets, and save the layout per user. full custom dashboards live in your Datadog/Grafana export.

what is the latency on alerts firing?

real-time alerts trigger within 30 seconds of the threshold breach. webhook delivery is typically under 5 seconds after that.

does cloudf.one offer SLO-as-code?

yes via the API. you can define SLOs in YAML, push via POST /slos, and get burn-rate alerts back through the same webhook channel.

yes. configure alerts at the account level (everyone with audit:view permission gets them) or scope to specific roles via the alert configuration page.

ready to make your fleet observable? start a cloudf.one trial, open the fleet dashboard, and watch the metrics flow in real time as you lock your first device.

cloud phone fleet monitoring dashboard guide

cloud phone fleet monitoring dashboard guide

what to monitor

the availability tile

the utilization tile

the performance tile

the errors tile

drill-down workflows

drill 1: offline devices

drill 2: slow region

drill 3: error storm

alerts to wire up

exporting metrics to Datadog or Grafana

option 1: native integrations

option 2: Prometheus endpoint

option 3: webhook + custom processor

SLO setup

the daily 5-minute monitoring routine

frequently asked questions

how far back does the fleet dashboard data go?

can I customize the dashboard layout?

what is the latency on alerts firing?

does cloudf.one offer SLO-as-code?

can multiple admins share dashboard alerts?