Back

Data Scientists, Start Using Profilers

Find the parts of your algorithm that are ACTUALLY slowing you down

Written by Austin Poor

Published: 2020-08-07

An image of a robot holding a stop-watch, generated by DALL-E.
An image of a robot holding a stop-watch, generated by DALL-E.

Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.

Wouldn’t you hate to waste your time refactoring one section of your code, trying to wring out every last ounce of performance, when a few simple changes to another section could speed up your code tenfold?

If you’re looking for a way to speed up your code, a profiler can show you exactly which parts are taking the most time, allowing you to see which sections would benefit most from optimization.

A profiler measures the time or space complexity of a program. There’s certainly value in theorizing about the big O complexity of an algorithm but it can be equally valuable to examine the real complexity of an algorithm.

Where is the biggest slowdown in your code? Is your code I/O bound or CPU bound? Which specific lines are causing the slowdowns?

Once you’ve answered those questions you’ll A) have a better understanding of your code and B) know where to target your optimization efforts in order to get the biggest boon with the least effort.

Let’s dive into some quick examples using Python.

The Basics

You might already be familiar with a few methods of timing your code. You could check the time before and after a line executes like this:

In [1]: start_time = time.time()
   ...: a_function() # Function you want to measure
   ...: end_time = time.time()
   ...: time_to_complete = end_time - start_time
   ...: time_to_complete
Out[1]: 1.0110783576965332 

Or, if you’re in a Jupyter Notebook, you could use the magic %time command to time the execution of a statement, like this:

In [2]: %time a_function()
CPU times: user 14.2 ms, sys: 41 µs, total: 14.2 ms
Wall time: 1.01 s 

Or, you could use the other magic command %timeit which gets a more accurate measurement by running the command multiple times, like this:

In [3]: %timeit a_function()
1.01 s ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternatively, if you want to time your whole script, you can use the bash command time, like so…

$ time python my_script.py

real    0m1.041s
user    0m0.040s
sys     0m0.000s 

These techniques are great if you want to get a quick sense of how long a script or a section of code takes to run but they are less useful when you want a more comprehensive picture. It would be a nightmare if you had to wrap each line in time.time() checks. In the next section, we’ll look at how to use Python’s built-in profiler.

Diving Deeper with cProfile

When you’re trying to get a better understanding of how your code is running, the first place to start is cProfile, Python’s built-in profiler. cProfile will keep track of how often and for how long parts of your program were executed.

Just keep in mind that cProfile shouldn’t be used to benchmark your code. It’s written in C which makes it fast but it still introduces some overhead that could throw off your times.

There are multiple ways to use cProfile but one simple way is from the command line.

Before we demo cProfile, let’s start by looking at a basic sample program that will download some text files, count the words in each one, and then save the top 10 words from each to a file. Now that being said, it isn’t too important what the code does, just that we’ll be using it to show how the profiler works.

import re
from collections import Counter
import urllib.request

def get_book(url):
  """Load text from a URL"""
  with urllib.request.urlopen(url) as response:
    text = response.read().decode(errors='ignore')
  return text.lower()
  
def split_words(book):
  """Extract the words with regex"""
  return re.split("[^A-Za-z]+",book)
  
def count_words(words):
  """Create a dictionary with word counts"""
  return Counter(words)
  
def read_books(urls):
  """For each url in urls, 
  load the book and count the words"""
  # Create a place to store word counts
  word_counts = {}
  # For each book: load it, count words, store the counts
  for title, path in urls.items():
    book = get_book(path)
    words = split_words(book)
    counts = count_words(words)
    word_counts[title] = counts.most_common()[:10]
  return word_counts
  
def save_results(results,path):
  """Save the results to a text file"""
  with open(path,'w') as f:
    for book, words in results.items():
      f.write(f"BOOK: {book}")
      for word, count in words:
        f.write(f"{' '*8}{word:10s}{count:6d}")
    
urls = {
  'pride-and-prejudice': \
    'https://www.gutenberg.org/files/1342/1342-0.txt',

  'alice-in-wonderland': \
    'https://www.gutenberg.org/files/11/11-0.txt',

  'sherlock-holmes': \
    'https://www.gutenberg.org/files/1661/1661-0.txt',

  'moby-dick': \
    'https://www.gutenberg.org/files/2701/2701-0.txt',
    
  'count-of-monte-cristo': \
    'https://www.gutenberg.org/files/1184/1184-0.txt'
}


output_file = "my-results.txt"

# Download the books and count words
results = read_books(urls)

# Save the results to a text file
save_results(results,output_file)

Now, with the following command, we’ll profile our script.

$ python -m cProfile -o profile.stat script.py

The -o flag specifies an output file for cProfile to save the profiling statistics.

Next, we can fire up python to examine the results using the pstats module (also part of the standard library).

In [1]: import pstats
   ...: p = pstats.Stats("profile.stat")
   ...: p.sort_stats(
   ...:   "cumulative"   # sort by cumulative time spent
   ...: ).print_stats(
   ...:   "script.py"    # only show fn calls in script.py
   ...: )
   
Fri Aug 07 08:12:06 2020    profile.stat46338 function calls (45576 primitive calls) in 6.548 secondsOrdered by: cumulative time
List reduced from 793 to 6 due to restriction <'script.py'>ncalls tottime percall cumtime percall filename:lineno(function)
     1   0.008   0.008   5.521   5.521 script.py:1(<module>)
     1   0.012   0.012   5.468   5.468 script.py:19(read_books)
     5   0.000   0.000   4.848   0.970 script.py:5(get_book)
     5   0.000   0.000   0.460   0.092 script.py:11(split_words)
     5   0.000   0.000   0.112   0.022 script.py:15(count_words)
     1   0.000   0.000   0.000   0.000 script.py:32(save_results)

Wow! Look at all that useful info!

For each function called, we’re seeing the following information:

  • ncalls: number of times the function was called
  • tottime: total time spent in the given function (excluding calls to sub-functions)
  • percall: tottime divided by ncalls
  • cumtime: total time spent in this function and all sub-functions
  • percall: (again) cumtime divided by ncalls
  • filename:lineo(function): the file name, line number, and function name

When reading this output, note the fact that we’re hiding a lot of data – in fact, we’re only seeing 6 out of 793 rows. Those hidden rows are all the sub-functions being called from within functions like urllib.request.urlopen or re.split. Also, note that the <module> row corresponds to the code in script.py that isn’t inside a function.

Now let’s look back at the results, sorted by cumulative duration.

ncalls tottime percall cumtime percall filename:lineno(function)
     1   0.008   0.008   5.521   5.521 script.py:1(<module>)
     1   0.012   0.012   5.468   5.468 script.py:19(read_books)
     5   0.000   0.000   4.848   0.970 script.py:5(get_book)
     5   0.000   0.000   0.460   0.092 script.py:11(split_words)
     5   0.000   0.000   0.112   0.022 script.py:15(count_words)
     1   0.000   0.000   0.000   0.000 script.py:32(save_results)

Keep in mind the hierarchy of function calls. The top-level, <module>, calls read_books and save_results. read_books calls get_book, split_words, and count_words. By comparing cumulative times, we see that most of <module>’s time is spent in read_books and most of read_books’s time is spent in get_book, where we make our HTTP request, making this script (unsurprisingly) I/O bound.

Next, let’s take a look at how we can be even more granular by profiling our code line-by-line.

Profiling Line-by-Line

Once we’ve used cProfile to get a sense of what function calls are taking the most time, we can examine those functions line-by-line to get an even clearer picture of where our time is being spent.

For this, we’ll need to install the line-profiler library with the following command:

$ pip install line-profiler

Once installed, we just need to add the @profile decorator to the function we want to profile. Here’s the updated snippet from our script:

@profile
def read_books(urls):
  """For each url in urls, 
  load the book and count the words"""
  # Create a place to store word counts
  word_counts = {}
  # For each book: load it, count words, store the counts
  for title, path in urls.items():
    book = get_book(path)
    words = split_words(book)
    counts = count_words(words)
    word_counts[title] = counts.most_common()[:10]
  return word_counts

Note the fact that we don’t need to import the profile decorator function — it will be injected by line-profiler.

Now, to profile our function, we can run the following:

$ kernprof -l -v script-prof.py

kernprof is installed along with line-profiler. The -l flag tells line-profiler to go line-by-line and the -v flag tells it to print the result to the terminal rather than save it to a file.

The result for our script would look something like this:

Wrote profile results to script-profile.py.lprof
Timer unit: 1e-06 s

Total time: 5.62982 s
File: script-profile.py
Function: read_books at line 19

Line #  Hits       Time  Per Hit  % Time  Line Contents
========================================================
    19                                    @profile
    20                                    def read_books(urls):
    21                                      """For each url in urls, 
    22                                      load the book and count the words"""
    23                                      # Create a place to store word counts
    24     1        1.0       1.0    0.0    word_counts = {}
    25                                      # Per book: load, count words, store counts
    26     6       15.0       2.5    0.0    for title, path in urls.items():
    27     5  5038674.0 1007734.8   89.5      book = get_book(path)
    28     5   431378.0   86275.6    7.7      words = split_words(book)
    29     5   121321.0   24264.2    2.2      counts = count_words(words)
    30     5    38429.0    7685.8    0.7      word_counts[title] = counts.most_common()[:10]
    31     1        0.0       0.0    0.0    return word_counts

The key column to focus on here is % Time. As you can see, 89.5% of our time parsing each book is spent in the get_book function — making the HTTP request — further validation that our program is I/O bound rather than CPU bound.

Now, with this new info in mind, if we wanted to speed up our code we wouldn’t want to waste our time trying to make our word counter more efficient. It only takes a fraction of the time compared to the HTTP request. Instead, we’d focus on speeding up our requests — possibly by making them asynchronously.

Here, the results are hardly surprising, but on a larger and more complicated program, line-profiler is an invaluable tool in our programming tool belt, allowing us to peer under the hood of our program and find the computational bottlenecks.

Profiling Memory

In addition to profiling the time-complexity of our program, we can also profile its memory-complexity.

In order to do line-by-line memory profiling, we’ll need to install the memory-profiler library which also uses the same @profile decorator to determine which function to profile.

$ pip install memory-profiler
$ python -m memory_profiler script.py

The result of running memory-profiler on our same script should look something like the following:

Filename: script.py

Line #    Mem usage    Increment   Line Contents
================================================
    19   38.426 MiB   38.426 MiB   @profile
    20                             def read_books(urls):
    21                               """For each url in urls, 
    22                               load the book and count the words"""
    23                               # Create a place to store word counts
    24   38.426 MiB    0.000 MiB     word_counts = {}
    25                               # For each book: load it, count words, store the counts
    26   85.438 MiB    0.000 MiB     for title, path in urls.items():
    27   72.137 MiB    9.281 MiB       book = get_book(path)
    28   95.852 MiB   23.715 MiB       words = split_words(book)
    29   85.438 MiB    0.496 MiB       counts = count_words(words)
    30   85.438 MiB    0.254 MiB       word_counts[title] = counts.most_common()[:10]
    31   85.438 MiB    0.000 MiB     return word_counts

There are currently some issues with the accuracy of the “Increment” so just focus on the “Mem usage” column for now.

Our script had peak memory usage on line 28 when we split the books up into words.

Conclusion

Hopefully, now you’ll have a few new tools in your programming tool belt to help you write more efficient code and quickly determine how to best spend your optimization-time.

You can read more about cProfile here, line-profiler here, and memory-profiler here. I also highly recommend the book High Performance Python, by Micha Gorelick and Ian Ozsvald [1].

References

[1] M. Gorelick and I. Ozsvald, High Performance Python 2nd Edition (2020), O’Reilly Media, Inc.