Thursday, December 24, 2015

Set up a Python dev environment easily for data scientist

More than 3 years ago, I wrote about how to set up Python3 dev environment on the Mac OSX. You need to jump through several hoops to get the job done.

Now, driven by the needs of data science, Python has become the 4th most popular language (according to TIOBE Index for Dec) and there have been a lot of interesting work to improve the usability of the tools.

Based on the homework I did today, the easiest way to set up your python dev environment is simply by using Anaconda. It is a great open source analytics platform from Continuum Analytics. It comes with toolings such as conda, the package manager, and many popular python libraries for data science needs. The company also offers cloud-based services for life cycle management of python packages, notebooks, etc.

Fat Installation with Anaconda

Simply follow the instructions here: http://docs.continuum.io/anaconda/install. By default it installs to your home directory (~/anaconda), which can be customized with the installer. You need to add ~/anaconda/bin to your PATH if the installer does not patch your PATH environment setting.

To update your Anaconda installation, simply run:
>conda update anaconda

Conda is a great package manager for Python, more details on conda later in the post.

So, what got installed? You can find the detailed list here. If you also need to work with R, you can install r-essentials by running:
>conda install -c r r-essentials

This installs "IRKernel and over 80 mostly used R packages including dplyr, shiny, ggplot2, tidyr, caret and nnet".

Slim Installation with Miniconda

If you do not want to use the fat installation from Anaconda, you can also install Miniconda, which only includes Python and several essential packages. You can download the installer for your platform, see instructions here.

Using Anaconda

With Anaconda or Miniconda installed, you are all set for development. Several quick notes that could help you have more fun.

conda, a package manager to rule them all

Conda is the command line package manager that solves a lot of issues with package and library management with Python. It is actually a package manager not just for Python, I even found NodeJS libraries there.

A quick list of features conda provides:
  • virtual environments: it enables you to create separate environments with different Python version, list of libraries, etc. Something Virtualenv tries to provide, but much easier.
  • package management
  • build and distribute packages: you can either use Anadonda Cloud service, or host your own easily.

To learn more, check out conda cheat sheet (PDF), read conda official doc and watch the demo video (around 20 min, highly recommend).

Anaconda Cloud

Anaconda cloud (previously known as Binstar) is a hosted package management service for notebooks, environments, conda and PyPI packages, etc. Several quick links:




IDE integration

Anaconda can be easily integrated with your favorite IDEs, as mentioned here. To be frank, I am not aware of so many Python IDEs. I mostly use either text editor (such as VIM) or PyCharm from JetBrains.

The latest PyCharm already supports conda. All you need to do is add a new interpreter in preferences, and set it to your Anaconda python installation (e.g. ~/anaconda/bin/python) or the specific conda environment python installation (PyCharm supports both VirtualEnv and Conda env).


Ok, that's about it, hope you enjoy Anaconda and Python without the hassle of dev environment setup anymore.

Tuesday, December 22, 2015

how Dropbox and Evernote file sync works

Spent some time trying to understand better how Dropbox and Evernote sync changes across different devices (mobile, desktop client, Web browser, etc.). Here are some of the interesting papers and articles that I found. Note that they might be outdated a little (you can see the timestamps).

Dropbox:


1. IMC'12 paper "Inside Dropbox: Understanding Personal Cloud Storage Services" by Idilio Drago, etc. (Nov 2012).

The paper studies the Dropbox architecture and traffic by intercepting and analyzing traffic data. It provides many insights about various control and data flow, service components, protocols, etc.

2. "Streaming File Synchronization" from Dropbox tech blog (July 2014).

This blog post gives an overview of file sync and presents a new stream-based sync mechanism that improves latency by upto 2X.

3. "Inside LAN Sync" from Dropbox tech blog (Oct 2015).

This blog post describes a new enhancement called LAN sync, which allows devices on the same network to share files without upload/download through Dropbox servers.

Evernote:


1. Core Concepts in dev API doc, especially the Data Model, which defines various data structures, familiar to you if you use Evernote a lot, like me.

2. Synchronization specification in dev doc: "Evernote Synchronization via EDAM" (Jan 2013)


If you are interested, you can also look into FUSE, which is a user space filesystem abstraction. You can build your own Dropbox and Evernote ;-)