Thursday, December 24, 2015

Set up a Python dev environment easily for data scientist

More than 3 years ago, I wrote about how to set up Python3 dev environment on the Mac OSX. You need to jump through several hoops to get the job done.

Now, driven by the needs of data science, Python has become the 4th most popular language (according to TIOBE Index for Dec) and there have been a lot of interesting work to improve the usability of the tools.

Based on the homework I did today, the easiest way to set up your python dev environment is simply by using Anaconda. It is a great open source analytics platform from Continuum Analytics. It comes with toolings such as conda, the package manager, and many popular python libraries for data science needs. The company also offers cloud-based services for life cycle management of python packages, notebooks, etc.

Fat Installation with Anaconda

Simply follow the instructions here: http://docs.continuum.io/anaconda/install. By default it installs to your home directory (~/anaconda), which can be customized with the installer. You need to add ~/anaconda/bin to your PATH if the installer does not patch your PATH environment setting.

To update your Anaconda installation, simply run:
>conda update anaconda

Conda is a great package manager for Python, more details on conda later in the post.

So, what got installed? You can find the detailed list here. If you also need to work with R, you can install r-essentials by running:
>conda install -c r r-essentials

This installs "IRKernel and over 80 mostly used R packages including dplyr, shiny, ggplot2, tidyr, caret and nnet".

Slim Installation with Miniconda

If you do not want to use the fat installation from Anaconda, you can also install Miniconda, which only includes Python and several essential packages. You can download the installer for your platform, see instructions here.

Using Anaconda

With Anaconda or Miniconda installed, you are all set for development. Several quick notes that could help you have more fun.

conda, a package manager to rule them all

Conda is the command line package manager that solves a lot of issues with package and library management with Python. It is actually a package manager not just for Python, I even found NodeJS libraries there.

A quick list of features conda provides:
  • virtual environments: it enables you to create separate environments with different Python version, list of libraries, etc. Something Virtualenv tries to provide, but much easier.
  • package management
  • build and distribute packages: you can either use Anadonda Cloud service, or host your own easily.

To learn more, check out conda cheat sheet (PDF), read conda official doc and watch the demo video (around 20 min, highly recommend).

Anaconda Cloud

Anaconda cloud (previously known as Binstar) is a hosted package management service for notebooks, environments, conda and PyPI packages, etc. Several quick links:




IDE integration

Anaconda can be easily integrated with your favorite IDEs, as mentioned here. To be frank, I am not aware of so many Python IDEs. I mostly use either text editor (such as VIM) or PyCharm from JetBrains.

The latest PyCharm already supports conda. All you need to do is add a new interpreter in preferences, and set it to your Anaconda python installation (e.g. ~/anaconda/bin/python) or the specific conda environment python installation (PyCharm supports both VirtualEnv and Conda env).


Ok, that's about it, hope you enjoy Anaconda and Python without the hassle of dev environment setup anymore.

1 comment: