Retour au blog
Method

How I Audit a BigQuery Environment in 2 Hours

My method to quickly analyze a BigQuery environment and identify the main optimization opportunities. SQL queries included.

5 février 202618 min de lecture

How I Audit a BigQuery Environment in 2 Hours

When I do a BigQuery FinOps audit, I have 2 hours to understand where the money goes and identify the quick wins. Here's my method, step by step, with all the SQL queries.

My strong opinion: Never start an audit with storage. I see too many people spending hours analyzing storage costs when 90% of the time, it's compute (queries) that costs money. Storage is $0.02/GB/month. A single bad query can cost more than 10 TB of storage.


Prerequisites

Required access:

  • Read on INFORMATION_SCHEMA (project or region level)
  • Read on billing exports (if available)
  • Ideally: BigQuery Admin access to see all jobs

Tools:

  • BigQuery console or SQL client
  • Spreadsheet to note findings
  • Calculator (or mental math)

Phase 1: Overview (15 min)

1.1 Total cost for the last 30 days

-- Total on-demand cost (30 days)
-- Rate: $6.25/TB (US multi-region)
SELECT
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25, 2) AS cost_usd,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25 * 0.92, 2) AS cost_eur,
  COUNT(*) AS total_queries,
  COUNT(DISTINCT user_email) AS unique_users
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE';

What I note:

  • Estimated monthly cost
  • Number of queries (volume)
  • Number of active users
  • Cost/user ratio

1.2 Day-by-day evolution

SELECT
  DATE(creation_time) AS day,
  COUNT(*) AS queries,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25, 2) AS cost_usd  -- $6.25/TB
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
GROUP BY day
ORDER BY day;

What I'm looking for:

  • Trend (increasing, stable, spikes)
  • Anomalies (days at 10x normal cost)
  • Weekday vs weekend pattern

At one client, I saw a bill that tripled every Monday. After investigation: an Airflow job that recalculated all metrics from the previous week using a full scan instead of using partitioning. 3 lines of SQL to change, -65% on the monthly bill.

1.3 Breakdown by cost type

SELECT
  job_type,
  COUNT(*) AS count,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25, 2) AS cost_usd,  -- $6.25/TB
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25 /
    SUM(SUM(total_bytes_billed)) OVER() * 100, 1) AS pct
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY job_type
ORDER BY cost_usd DESC;

Typical healthy breakdown:

  • QUERY: 70-85%
  • LOAD: 5-15%
  • COPY: 0-5%
  • EXTRACT: 0-5%

Phase 2: Identify heavy consumers (20 min)

2.1 Top users

SELECT
  user_email,
  COUNT(*) AS query_count,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS tb_scanned,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25, 2) AS cost_usd,  -- $6.25/TB
  ROUND(AVG(total_bytes_billed) / POW(1024, 3), 2) AS avg_gb_per_query
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
GROUP BY user_email
ORDER BY cost_usd DESC
LIMIT 15;

What I'm looking for:

  • Service accounts (dbt, Airflow, Looker) vs humans
  • Concentration: the top 3 often account for 60-80% of costs
  • Users with abnormally high avg_gb_per_query

In my audits, I consistently find that 80% of costs come from 3-4 service accounts. The problem is that nobody really knows what they do. Data teams configured pipelines 2 years ago and haven't touched them since. First action: identify the owner of each service account.

2.2 Top individual queries

SELECT
  job_id,
  user_email,
  ROUND(total_bytes_billed / POW(1024, 4) * 6.25, 2) AS cost_usd,  -- $6.25/TB
  ROUND(total_bytes_billed / POW(1024, 3), 1) AS gb_scanned,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec,
  cache_hit,
  SUBSTR(query, 1, 300) AS query_preview
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND total_bytes_billed > 0
ORDER BY total_bytes_billed DESC
LIMIT 25;

Red flags:

  • Queries > $10 → investigate
  • Queries > $100 → potential incident
  • Same query repeated without cache hit

2.3 Most frequent queries

-- Hash queries to group similar ones
-- Rate: $6.25/TB (US multi-region)
SELECT
  FARM_FINGERPRINT(REGEXP_REPLACE(query, r'\d+', 'N')) AS query_hash,
  COUNT(*) AS executions,
  ROUND(SUM(total_bytes_billed) / POW(1024, 4) * 6.25, 2) AS total_cost_usd,
  ROUND(AVG(total_bytes_billed) / POW(1024, 3), 2) AS avg_gb,
  ANY_VALUE(SUBSTR(query, 1, 200)) AS query_sample
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
GROUP BY query_hash
HAVING COUNT(*) > 10
ORDER BY total_cost_usd DESC
LIMIT 20;

What I'm looking for:

  • Queries executed 100+ times/day → candidate for materialization
  • High total cost but low unit cost → volume to optimize

Phase 3: Analyze problematic patterns (30 min)

3.1 SELECT * (the classic)

The mistake I see most often: SELECT * in dbt pipelines. Data engineers think "it's just for development, we'll optimize later". Except "later" never comes, and these queries run 50 times a day in prod.

SELECT
  job_id,
  user_email,
  ROUND(total_bytes_billed / POW(1024, 3), 1) AS gb_scanned,
  ROUND(total_bytes_billed / POW(1024, 4) * 6.25, 2) AS cost_usd,  -- $6.25/TB
  SUBSTR(query, 1, 150) AS query_preview
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  -- Heuristic: catches SELECT * and SELECT  * (multiple spaces)
  -- Some false negatives possible (exotic formatting), but covers 99% of cases
  AND REGEXP_CONTAINS(UPPER(query), r'SELECT\s+\*')
  AND total_bytes_billed > 10 * POW(1024, 3)  -- > 10 GB
ORDER BY total_bytes_billed DESC
LIMIT 20;

3.2 Full table scans (no partition filter)

SELECT
  job_id,
  user_email,
  referenced_tables,
  ROUND(total_bytes_billed / POW(1024, 3), 1) AS gb_scanned,
  ROUND(total_bytes_billed / POW(1024, 4) * 6.25, 2) AS cost_usd,  -- $6.25/TB
  SUBSTR(query, 1, 200) AS query_preview
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND total_bytes_billed > 100 * POW(1024, 3)  -- > 100 GB
  -- Hint: no visible date filter
  AND NOT REGEXP_CONTAINS(LOWER(query), r'where.*date|partition|_partitiontime')
ORDER BY total_bytes_billed DESC
LIMIT 15;

3.3 Cache rate

SELECT
  DATE(creation_time) AS day,
  COUNTIF(cache_hit) AS cache_hits,
  COUNTIF(NOT cache_hit) AS cache_misses,
  ROUND(COUNTIF(cache_hit) / COUNT(*) * 100, 1) AS cache_hit_rate
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
GROUP BY day
ORDER BY day;

Benchmark:

  • < 20% cache hit → problem (queries too varied or tables modified often)
  • 20-40% → normal for analytics
  • 40% → good, well-configured dashboards

3.4 Failures and timeouts

SELECT
  state,
  error_result.reason AS error_reason,
  COUNT(*) AS count,
  ROUND(SUM(total_bytes_processed) / POW(1024, 4), 2) AS tb_processed_anyway
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND state != 'DONE'
GROUP BY state, error_reason
ORDER BY count DESC;

What I'm looking for:

  • Many CANCELLED → timeouts, possible waste
  • Recurring errors → broken pipelines that retry

Phase 4: Analyze storage (20 min)

Reminder: I do this phase last, not first. Storage rarely costs more than 10-15% of the total BigQuery bill.

4.1 Storage overview

SELECT
  table_schema AS dataset,
  COUNT(*) AS table_count,
  ROUND(SUM(active_physical_bytes) / POW(1024, 4), 2) AS active_tb,
  ROUND(SUM(long_term_physical_bytes) / POW(1024, 4), 2) AS longterm_tb,
  ROUND(SUM(active_physical_bytes + long_term_physical_bytes) / POW(1024, 4), 2) AS total_tb
FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
GROUP BY dataset
ORDER BY total_tb DESC
LIMIT 20;

4.2 Active vs long-term ratio

SELECT
  ROUND(SUM(active_physical_bytes) / POW(1024, 4), 2) AS active_tb,
  ROUND(SUM(long_term_physical_bytes) / POW(1024, 4), 2) AS longterm_tb,
  ROUND(SUM(active_physical_bytes) /
    (SUM(active_physical_bytes) + SUM(long_term_physical_bytes)) * 100, 1) AS active_pct
FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE;

Benchmark:

  • Active > 50% → missing lifecycle policies
  • Active < 30% → well optimized

4.3 Potentially unused tables

SELECT
  table_schema,
  table_name,
  ROUND(total_physical_bytes / POW(1024, 3), 2) AS size_gb,
  TIMESTAMP_MILLIS(last_modified_time) AS last_modified,
  DATE_DIFF(CURRENT_DATE(), DATE(TIMESTAMP_MILLIS(last_modified_time)), DAY) AS days_since_modified
FROM `project.dataset.__TABLES__`
WHERE DATE_DIFF(CURRENT_DATE(), DATE(TIMESTAMP_MILLIS(last_modified_time)), DAY) > 90
ORDER BY total_physical_bytes DESC
LIMIT 20;

At one client, I found 2 TB of "tmp_" and "test_" tables created by data scientists who had left the company 18 months ago. Nobody dared delete them "just in case". We cleaned up together: €400/month savings just on storage.

4.4 Tables without partitioning

-- Large tables without partitioning option
SELECT
  ts.table_schema,
  ts.table_name,
  ROUND(ts.total_physical_bytes / POW(1024, 3), 2) AS size_gb
FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE ts
WHERE ts.total_physical_bytes > 10 * POW(1024, 3)  -- > 10 GB
  AND NOT EXISTS (
    SELECT 1 FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_OPTIONS opt
    WHERE opt.table_schema = ts.table_schema
      AND opt.table_name = ts.table_name
      AND opt.option_name IN ('partition_expiration_days', 'require_partition_filter')
  )
ORDER BY ts.total_physical_bytes DESC;

Note: This query detects tables without configured partitioning options. For exhaustive detection, also manually check _PARTITIONTIME or _PARTITIONDATE columns on suspect tables.


Phase 5: Check governance (15 min)

5.1 Configuration of critical tables

SELECT
  table_schema,
  table_name,
  ROUND(total_physical_bytes / POW(1024, 3), 2) AS size_gb,
  -- This info comes from table options
  'CHECK MANUALLY' AS require_partition_filter,
  'CHECK MANUALLY' AS partition_expiration
FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
WHERE total_physical_bytes > 100 * POW(1024, 3)  -- > 100 GB
ORDER BY total_physical_bytes DESC
LIMIT 10;

For each large table, check manually:

SELECT option_name, option_value
FROM `project.dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE table_name = 'my_large_table';

5.2 Existing labels/tags

SELECT
  table_schema,
  table_name,
  labels
FROM `project`.`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
WHERE labels IS NOT NULL
LIMIT 20;

What I'm looking for:

  • % of tables with labels → governance in place or not
  • Labels used → team, env, cost_center?

5.3 Quotas and limits

Check in the GCP console:

  • Project-level quotas configured?
  • Custom quotas per user?
  • Budget alerts in place?

What I advise against: Setting user quotas too strict from the start. I've seen data teams completely blocked because a FinOps admin set a 1 TB/day quota without warning. Start by monitoring, then adjust.


Phase 6: Synthesis and recommendations (20 min)

Report template

Recommended structure:

1. Executive summary

  • Current monthly cost: €XX
  • Identified potential savings: €XX (XX%)
  • Priority #1: [Main quick win]

2. Findings

  • Critical: [Finding] - Impact: €XX/month - Effort: Low/Medium/High
  • Important: ...
  • Improvements: ...

3. Top 5 Actions

PriorityActionSavingsEffortTimeline
1...€XXLow1 day
2...€XXMedium1 week
3...€XX......
4...€XX......
5...€XX......

4. Metrics to monitor

  • Average daily cost
  • Cache hit rate
  • Top 3 consumers

Summary checklist

Phase 1: Overview (15 min)

  • Total 30-day cost
  • Day-by-day evolution
  • Breakdown by type

Phase 2: Heavy consumers (20 min)

  • Top 15 users
  • Top 25 queries
  • Frequent queries

Phase 3: Problematic patterns (30 min)

  • SELECT *
  • Full table scans
  • Cache rate
  • Failures

Phase 4: Storage (20 min)

  • Overview
  • Active/long-term ratio
  • Zombie tables
  • Unpartitioned tables

Phase 5: Governance (15 min)

  • Large table config
  • Labels
  • Quotas

Phase 6: Synthesis (20 min)

  • Written report
  • Top 5 actions
  • Monitoring metrics

Conclusion

In 2 hours, you can have a clear view of:

  • Where the money goes
  • Who consumes the most
  • Which patterns are problematic
  • What the quick wins are

The most important thing is not to be exhaustive, but to identify the 20% of actions that will bring 80% of the savings.

This method, I've refined over dozens of audits. It works as well for a startup with €500/month on BigQuery as for a scale-up at €50k/month. The difference is just the number of zeros on the identified savings.


If this type of content helps you, you can follow me on LinkedIn or subscribe to the newsletter for future articles.

Jonathan Kini

Jonathan Kini

J'aide les équipes data à réduire et maîtriser leurs coûts BigQuery et Snowflake, sans sacrifier la performance. De la startup aux environnements data à grande échelle.