Painless Data Augmentation with BigQuery
Quickly Augmenting Your Datasets with BigQuery Public Data
Written by Austin Poor
Published 2021-01-07

Google Cloud’s BigQuery is a great tool for data scientists to quickly and easily augment their datasets with external data. Specifically, BigQuery has a listing of public datasets from a variety of different sources. All you need is a Google Cloud account and some basic SQL knowledge.
Here are just a few useful public datasets:
- US Census American Community Survey
- COVID-19 Data from Google
- NOAA Global Weather Data
- Public Domain Artworks from The Met
- Bitcoin Transaction Data
A Quick Example
I think one of the most useful among the BigQuery public datasets is the US Census ACS data, which gives multi-year data broken down geographically (by state, zip code, county, etc.).
It has a lot of great demographic information like population (broken down by age, race, gender, marital status, etc.), education levels, employment, income, and much more.
For example, say I wanted to query the total population and median household income for three zip codes in the NYC area. There’s a table called zip_codes_2018_5yr
that gives a 5-year estimate of census data for the year 2018, broken down by zip code.
Here’s what my query will look like:
SELECT
geo_id, -- Where geo_id is the zip code
total_pop,
median_income
FROM
`bigquery-public-data`.census_bureau_acs.zip_codes_2018_5yr
WHERE
geo_id in ("11377","11101","10708");
And I can run it in the BigQuery UI…
And get the following results…
Great! I got my answer in 0.4 seconds and now I can go back and expand my query to get this data for multiple years. Or, I can export the results to a CSV or JSON file to join it up with my data.
Finally, as a bonus, you can connect to BigQuery through Python with the package google-cloud-bigquery.
Links / Further Reading
- More About BigQuery
- BigQuery’s Public Datasets
- BigQuery SQL Syntax
- Querying BigQuery Tables from Python
Thank you so much for reading! If you have any thoughts, questions, or comments, I'd love to hear them. You can find me on Twitter, Mastodon, or LinkedIn.
If you liked this post, you might also like:
-
Quickly Load CSVs into PostgreSQL Using Python and Pandas
Use Pandas to quickly create and populate a Postgres database
-
Handling ML Predictions in a Flask App
Don't let long-running code slow down your Flask app
-
Take Full Control of Your Python Plots with Jinja
Create fully custom plots in Python with SVG and Jinja