AP

Back

Quickly Load CSVs into PostgreSQL Using Python and Pandas

Use Pandas to quickly create and populate a Postgres database

Written by Austin Poor

Published: 2020-02-13

#csv #sql #postgres #pandas #data-science
A flowchart from a CSV file icon to the Pandas library icon to a Postgres database icon.
Use Pandas to quickly load your CSV data into a Postgres database

Sometimes it can be kind of a pain to deal with copying CSV data into a Postgres database — especially if you don’t want to write out a long schema. Why not let Pandas do all that legwork for you? I’ll walk you through a quick example using the Iris dataset (here’s a link to the data).

Loading CSV Data Into Postgres with Pandas

Assuming you have a Postgres server up and running, you’ll need to either create a database to store your data or use an existing one. Using your preferred Postgres interface (i.e. psql or pgAdmin), you can run the following to create your database…

CREATE DATABASE iris;

Now that you have a database to store your table, you can move over to Python to import your data. You’ll need the SQLAlchemy Python toolkit (if you don’t already have it, it can be installed with pip install SQLAlchemy).

You’re going to load your CSV data with Pandas and then use SQLAlchemy to pass the data to Postgres. The SQLAlchemy.create_engine will need the following 6 pieces of information:

  • The DBMS — In this case, that will be “postgres”
  • A username for the database
  • That user’s password
  • The hostname for the Postgres server
  • The port for the Postgres server (5432 by default)
  • The name of the database to store the new table

And it will be formatted as follows:

create_engine("<dbms>://<username>:<password>@<hostname>:<port>/<db_name>")

Here’s how that will look in code…

# Imports
import pandas as pd
from sqlalchemy import create_engine

# This CSV doesn't have a header so pass
# column names as an argument
columns = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "class"
]

# Load in the data
df = pd.read_csv(
    "iris.csv",
    names=columns
)

# Instantiate sqlachemy.create_engine object
engine = create_engine(
    'postgresql://postgres:my_password@localhost:5432/iris'
)

# Save the data from dataframe to
# postgres table "iris_dataset"
df.to_sql(
    'iris_dataset', 
    engine,
    index=False # Not copying over the index
)

If your CSV data is too large to fit into memory, you might be able to use one of these two options…

Working with Large Datasets: Option 1

One option would be to use the Pandas chunksize argument for pd.read_csv which will return a generator that will iterate through rows of the CSV and yield DataFrames with the number of rows corresponding to the specified chunksize. Here’s what that might look like…

# Imports
import pandas as pd
from sqlalchemy import create_engine

# This CSV doesn't have a header so pass
# column names as an argument
columns = [
  "sepal_length",
  "sepal_width",
  "petal_length",
  "petal_width",
  "class"
]

# Instantiate sqlachemy.create_engine object
engine = create_engine(
    'postgresql://postgres:my_password@localhost:5432/iris'
)

# Create an iterable that will read "chunksize=1000" rows
# at a time from the CSV file
for df in pd.read_csv("iris.csv",names=columns,chunksize=1000):
  df.to_sql(
    'iris_dataset', 
    engine,
    index=False,
    # if the table already exists, append this data
    if_exists='append',
  )

Working with Large Datasets: Option 2

Another option (which is a bit more of a hack) would be to load a few rows of data into pandas — enough for it to infer the column datatypes — and then read in the CSV using psql. Here’s what that might look like…

# Imports
import pandas as pd
from sqlalchemy import create_engine

# This CSV doesn't have a header so pass
# column names as an argument
columns = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "class"
]

# Load in the data
df = pd.read_csv(
    "iris.csv",
    names=columns,
    nrows=1000    # load in the first 1000 rows
)

# Instantiate sqlachemy.create_engine object
engine = create_engine(
    'postgresql://postgres:my_password@localhost:5432/iris'
)

# Save the data from dataframe to
# postgres table "iris_dataset"
df.to_sql(
    'iris_dataset', 
    engine,
    index=False
)

Then, once Pandas has created the table you can run the following two lines in psql in order to delete those few rows added by Pandas and let Postgres re-import the full data…

DELETE FROM iris_dataset;

\COPY iris_dataset FROM 'downloads/iris.csv' DELIMITER ',' CSV; 

Hopefully that helps and let me know if you have any other suggested techniques!


Thank you so much for reading! If you have any thoughts, questions, or comments, I'd love to hear them. You can find me on Twitter, Mastodon, or LinkedIn.

If you liked this post, you might also like: