Python for the Social Sciences: Toolkit Essentials

 

Why Py?
Why should psychologists, or social scientists more generally, care about programming? The fact is anyone who uses softwares for data analysis relies on programming and many wonderful tools exist to give researchers improved control over their data for more efficient workflow. Although learning programming may seem an insurmountable task, with the right tools a few simple lines of code can accomplish wonderful things. I hope this blog will present, in a very understandable way, some of the best tools for social scientists looking to take advantage of programming to improve their data processing.
Some of the major tasks (and potential major time sucks) for many graduate students and quantitative researchers revolve around data management, processing, analysis and graphing. These tasks can be slowed down by the need to switch softwares, data formats, and at times do manipulations or calculations by hand. Throughout my time as a researcher, I’ve relied on various softwares for these steps including Excel (pivot tables are great for summarizing data that is in long/row format), SPSS, E-Prime, Bash scripts, HLM & Python. These are just a few of the many software programs out there; see these Wikipedia articles for much longer lists of available numerical softwares and statistical packages. With so many options and different capabilities and learning curves for each, it shouldn’t be surprising that there exists so much diversity in processing pipelines across fields, between labs and sometimes even between individuals within labs. These differences often reflect differences in the nature of the data (i.e., it may not make sense to process fMRI data the same was as survey data or financial time series data) but often are simply a matter of convenience, comfort, or necessity of the individual researcher. The use of multiple softwares often means having to create multiple copies of your data file, running the risk of a conversion error, sometimes copying data by hand from one system to another, and generally making it harder to keep a clean track of the full processing pipeline.
Learning scientific computing in Python was my solution for consolidating and streamlining data processing pipelines from several softwares into one (actually 2, but more on that below). Although learning programming seems like an ominous task, especially if you aren’t especially tech savvy (after all, I am a developmental psychologist, not a computer scientist), with the right tools you can quickly learn how to make the task much more manageable or even automated. I would also like to point out at this time, that I will only be presenting the advantages of Python over the other softwares I have used, I can not speak to the abilities of other programming languages (MATLAB, R).

Open Source Software and Modules
Python and many of its libraries are open source, meaning they are free to access and the entire source code is available so you can always check what is happening to your data under the hood. Python is a higher-level, general purpose language making it capable of accomplishing almost any task you could think of. There is also a ton of great online documentation, support (Stack Overflow is often the best source of help), and mailing lists for reference. In order to fully utilize Python, it is essential to learn how to manage libraries/modules/packages and set up an interactive environment for running and testing code. These packages can quickly and substantially expand the power of your code, but can be a major headache to find, install and manage. Additionally, some packages have dependencies, or other packages that must be installed for them to run. The remainder of this blog will focus on introducing some essential packages necessary for scientific computing in Python and how to quickly get them installed.

Anaconda
Anaconda from Continuum Analytics is a free distribution of Python and 125+ additional packages geared towards scientific computing. Anaconda can be set up on Mac, PC and Unix systems and will save you from having to download and install each individual package manually (something I still haven’t managed to do successfully). Anaconda can also be used to manage and update these packages from your command prompt. Finding Anaconda was the jumping off point for me being able to move the vast majority of my data processing to Python and will include all of the packages and application outlined below (except for a few in the Other Tools section). Once you’ve successfully installed Anaconda, you’re ready for scientific computing!
Python Environments
Once downloaded, Anaconda provides an app launcher (Launcher.app in the anaconda folder) with three environments: IPython notebook, IPython QT-console and Spyder. These environments allow for interactive coding and set up how Python will interact with your console including printing output, executing code and displaying images. IPython and IPython QT are interactive shells for running your code and seeing the output inline. Spyder is designed for scientific computing and sets up a display similar to R-Studio or MATLAB. From here you can create scripts, import modules and run your code. To create custom environments, use the import function to bring in whichever of the 125+ modules you need (key packages outlined below).

SciPy and NumPy
SciPy (SP) and NumPy (NP) are two essential packages for allowing Python to read, manipulate and analyze data in numerical array formats. These packages are very useful on their own and are also dependencies for many other packages. NP stores data in named arrays (similar to sets of data with column labels) of various types: integer, float, and object (string). NP allows for simple arithmetic and matrix operations. SP expands on these capabilities with more advanced statistical functions such as standard deviation, mean, etc. The SP.stats module includes correlations, t-tests, ANOVA and rank order tests. These packages can largely replace manipulations previously done in Excel and SPSS, however it is worth noting that there is no reliable repeated-measures GLM function, for which I still rely on SPSS.

Pandas
Pandas is my favorite Python package for data management and offers most of the capabilities of Excel and SPSS. Since discovering this package, I’ve tried to utilize it as my primary data management tool. Developed for quick processing of financial information, this package can be utilized for cleaning data (i.e. calculating mean, std, working with missing data and removing outliers based on individual or group values), grouping and summarizing data, and basic plotting. For an idea of the power of Pandas check out: 10-minutes to Pandas. Pandas can read in various data files (csv, tab-delimited, etc.) and allows for the creation of user-defined functions that can be applied to data row-wise or column-wise and can be quickly iterated through for multiple variables or ranges of values (i.e. outliers removed for 2 and 3 stds outside of the mean). The groupby function easily bins data by labels or values for data aggregation (think Excel pivot tables on steroids). Pandas also includes built-in basic plotting functions which generate graphs using matplotlib under the hood.

Plotting
Numerous solutions for plotting data have been developed in Python, I will present two of the most versatile other besides Pandas: matplotlib and plot.ly. Matplotlib is a module that offers advanced plotting capabilities similar to MATLAB in both 2-D and 3-D. For total control over plotting, matplotlib has the most options, but is rather difficult to navigate if you’re still new to Python. Describing the full potential of this module is well outside the range of this blog, but the examples page offers a good idea of the possibilities.
Plot.ly is a Python based platform for data management and plotting which combines some easy user interface options with code for custom plotting. There are numerous other options for plotting if neither of these meets your needs a full list can be found here. I’ve found that Python can be especially useful for generating large numbers of similar plots. You can easily combine a for loop with a plot function to generate plots with the same axes, colors, error bars and settings for any given number of groups or conditions. What previously took me hours to do by hand in Excel can now be accomplished in minutes.

Bringing it all together
Over the past two years I’ve slowly been trying to master Python and discovering these great tools has been an essential part of growing my abilities. It took me a considerable amount of time to track down these sources, so I hope this blog helps jumpstart your own journey as a programmer. There are numerous free introductions and guides to Python available online if you are looking for places to get started: Codecademy, Google, & MIT.

Other Tools
Python.os: Python’s built-in module for interacting with the operating system: moving and making directories; making, renaming, editing and deleting files; locating files and directories by name or data. This is a particularly useful package for working with data stored on a remote server.
Python.sys: Python’s built-in module for executing operating system commands and manipulating system output. When paired with os, provides bash-like scripting capabilities.
Python.collections: If you choose not to use Pandas, Python has alternative data structures that are useful for working with matrix like data. Ordered dictionaries are particularly handy.
Python.csv: Python tools for reading in data from files.
PsychoPy: This Python package can be used as a module or a stand alone platform and is designed to manage stimuli presentation and task design (Python based E-Prime replacement). As anyone who has ever designed a task knows, these softwares can be finicky. PsychoPy offers a stable way to generate tasks and has both code only and GUI interface.
PyMVPA: Toolbox for use with fMRI data for Multivariate pattern analysis.
NiBable: Toolbox for reading fMRI datasets and file types (dicom, Nifti, BRIK/HEAD, etc.). Also has various add-ons for diffusion imaging and pipeline creation.

Thanks for reading and please post any additional questions, comments or feedback!