Solving the Data Dilemma: Efficiently Load CSV Data into a SQL Database

5 min readMar 20, 2022

Photo by Caspar Camille Rubin on Unsplash

You’re probably reading this article because you have a similar challenge to the one I had — you need to get data from a spreadsheet into a SQL database. Pick your poison, but Python is a great tool to accomplish this.

This article will compare not one but FOUR ways to write data from a spreadsheet into a SQL table with Python in order to determine the most efficient method.

Connect to SQL

If you haven’t already connected to SQL in Python, you’ll want to do that. If you’re not sure how, I have a separate article on how to do this!

Once you’re set, your code should look somewhat like this:

# PACKAGES
import pyodbc
from getpass import getpass# INPUTS
username = input("Username: ")
password = getpass("Password: ")
database = input("Database Name: ")
schema = input("Schema: ")
table = input("Table: ")
dsn = input("DSN Name: ")# CONNECT TO DATABASE
conn = pyodbc.connect("DSN="+dsn+";UID="+username+";PWD="+password)
print(conn)

Note: I’m on Mac and had lots of configuration troubleshooting that ultimately led me to use a combination of a DSN and UN/PW to connect. If you’re on Windows, you might be using a different driver or credentials setup, so don’t worry if this looks a little bit different but still works.

Hopefully you get an output that looks something like:

<pyodbc.Connection object at 0x7f847b10ef10>

Read Spreadsheet

If you have an excel workbook:

# PACKAGES
import pandas as pd# READ EXCEL FILE
df = pd.read_excel("my_data.xlsx", dtype=str)

If you have a CSV:

# PACKAGES
import pandas as pd# READ CSV FILE
df = pd.read_csv("my_data.csv", dtype=str)

Write to the SQL Database

So here’s where it gets tricky; there are a lot of packages and methods through which you can write the contents of the dataframe to the database. But they are not all made equal.

Method 1: BULK INSERT (PERMISSIONS MIGHT RESTRICT)

There are security concerns around allowing developers the ability to use the BULK INSERT query, so if you’re using an organizational database you probably won’t be able to use this method. However, if you can, this is the code you’d want to try:

cursor = conn.cursor()
cursor.execute("BULK INSERT "+schema+"."+table+" FROM 'my_data.csv'")
conn.commit()

I’m not able to try this out, so let me know in the comments if you’re able to do this and how it performs against the other methods.

Method 2: Loop + Single Execute + INSERT INTO (SLOW)

This is the most basic and least optimal approach, so I would not recommend it, but it would be remiss of me to exclude this method from the list.

chunk = df.values.tolist()
tuple_of_tuples = tuple(tuple(x) for x in chunk)
cols = ("?,"*df.shape[1])[:-1]
cursor = conn.cursor()
for t in tuple_of_tuples:
   cursor.execute(f"INSERT INTO "+schema+"."+table+" VALUES ({cols})", t)
conn.commit()

Rule of thumb: before resigning yourself to looping through dataframe rows for reading/writing, check the documentation and stackoverflow forums for more efficient solutions.

Method 3: Execute Many + INSERT INTO (DEPENDS ON DRIVER)

If you have a connection to the SQL database using the ODBC driver, you’re one of the lucky ones.

I tried the following code and it failed, giving me a segfault error. After some research, I found that this was because fast_executemany = True uses an ODBC feature called “parameter arrays” and that feature is not supported FreeTDS and some other ODBC drivers.

However, one of my Windows-using teammates was able to run this with success.

chunk = df.values.tolist()
tuple_of_tuples = tuple(tuple(x) for x in chunk)
cols = ("?,"*df.shape[1])[:-1]
cursor = conn.cursor()
cursor.fast_executemany=True
cursor.executemany(f"INSERT INTO "+schema+"."+table+" VALUES ({cols})", tuple_of_tuples)
conn.commit()

Method 4: SQL Alchemy “to_sql” (WINNER?)

This method requires a little bit more setup, but the actually write operation is only one line of code. Too good to be true? Spoiler: it’s not!

First you need to create an engine object using SQL Alchemy by substituting in your credentials and database details. What this looks like will depend on how you connect to your database — I recommend having a DSN in your configuration files for ease and security in your code.

Once the engine is created, it’s only one line to write the data frame (e.g. your spreadsheet as a Python object) to the SQL table. Your parameters should be as follows:

con = engine you just created (con = connection)
name = SQL table to which you’re writing data
schema = SQL schema that the table is in
if_exists = what to do if the table already exists — I have it dropping the old data and inserting entirely fresh data (hence 'replace' in my code), but you can have it 'fail' if you don’t want any data to be written, or 'append' if you want to add data to the existing data.
index = whether or not to write the index column from the data frame
chunksize = the number of rows to be written to the table at once — this parameter tripped me up at first. Default behavior is to write all rows at once, however there is a limit on the number of fields it can write at once (chunksize = (number of rows * number of columns) / limit))
method = this one matters the most!!! If you leave it blank, the default is for it to use single INSERT calls, which is exactly the same as using Method #2 (slow). Using 'multi' will pass multiple values into a single INSERT call and drastically speeds it up!!!

# PACKAGES
import sqlalchemy# CREATE ENGINE - MAC WITH DSN
engine = sqlalchemy.create_engine("mssql+pyodbc://"+username+":"+password+"@"+dsn, connect_args={"database": database})  
  
# CREATE ENGINE - WINDOWS WITH NO DSN
server_name_or_ip = "XXX.XXX.XXX.XXX"
engine = sqlalchemy.create_engine("mssql+pyodbc://"+server_name_or_ip+"/"+database+"?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")df.to_sql(con=engine, name=table, schema=schema, if_exists="replace", index=False, chunksize=100, method="multi")

This method was very snappy for loading up my dataset with ~40 columns, and ~18,000 rows.

Conclusion & Final Thoughts

The first two methods, bulk insert and loop of single inserts, are the least ideal. When it comes to the most optimal, it’s probably a toss up between Methods #3 and #4 for Windows-users and definitely #4 for Mac-users.

Here’s the code from all four methods combined into a single gist — please try speed-testing these out and let me know if you had similar results!

Hope this article helped you with your data-processing dilemma and consider following me for more coding- and data-related tutorials!

Madison Gipson - Medium

Read writing from Madison Gipson on Medium. Data Scientist & SWE | NASA | BS Computer Science & MBA Student. Every day…

madisongipson.medium.com