AP

Back

Painless Data Augmentation with BigQuery

Quickly Augmenting Your Datasets with BigQuery Public Data

Written by Austin Poor

Published: 2021-01-07

An image of toy blocks being assembled into a tower by a little crane, generated by DALL-E.
An image of toy blocks being assembled into a tower by a little crane, generated by DALL-E.

Google Cloud’s BigQuery is a great tool for data scientists to quickly and easily augment their datasets with external data. Specifically, BigQuery has a listing of public datasets from a variety of different sources. All you need is a Google Cloud account and some basic SQL knowledge.

Here are just a few useful public datasets:

A Quick Example

I think one of the most useful among the BigQuery public datasets is the US Census ACS data, which gives multi-year data broken down geographically (by state, zip code, county, etc.).

It has a lot of great demographic information like population (broken down by age, race, gender, marital status, etc.), education levels, employment, income, and much more.

For example, say I wanted to query the total population and median household income for three zip codes in the NYC area. There’s a table called zip_codes_2018_5yr that gives a 5-year estimate of census data for the year 2018, broken down by zip code.

Here’s what my query will look like:

SELECT 
  geo_id, -- Where geo_id is the zip code
  total_pop,
  median_income
FROM 
  `bigquery-public-data`.census_bureau_acs.zip_codes_2018_5yr
WHERE 
  geo_id in ("11377","11101","10708"); 

And I can run it in the BigQuery UI…

Screenshot of the BigQuery UI

And get the following results…

Viewing query results in the BigQuery UI

Great! I got my answer in 0.4 seconds and now I can go back and expand my query to get this data for multiple years. Or, I can export the results to a CSV or JSON file to join it up with my data.

Screenshot showing export options for BigQuery results

Finally, as a bonus, you can connect to BigQuery through Python with the package google-cloud-bigquery.