Thursday, December 24, 2015

Set up a Python dev environment easily for data scientist

More than 3 years ago, I wrote about how to set up Python3 dev environment on the Mac OSX. You need to jump through several hoops to get the job done.

Now, driven by the needs of data science, Python has become the 4th most popular language (according to TIOBE Index for Dec) and there have been a lot of interesting work to improve the usability of the tools.

Based on the homework I did today, the easiest way to set up your python dev environment is simply by using Anaconda. It is a great open source analytics platform from Continuum Analytics. It comes with toolings such as conda, the package manager, and many popular python libraries for data science needs. The company also offers cloud-based services for life cycle management of python packages, notebooks, etc.

Fat Installation with Anaconda

Simply follow the instructions here: http://docs.continuum.io/anaconda/install. By default it installs to your home directory (~/anaconda), which can be customized with the installer. You need to add ~/anaconda/bin to your PATH if the installer does not patch your PATH environment setting.

To update your Anaconda installation, simply run:
>conda update anaconda

Conda is a great package manager for Python, more details on conda later in the post.

So, what got installed? You can find the detailed list here. If you also need to work with R, you can install r-essentials by running:
>conda install -c r r-essentials

This installs "IRKernel and over 80 mostly used R packages including dplyr, shiny, ggplot2, tidyr, caret and nnet".

Slim Installation with Miniconda

If you do not want to use the fat installation from Anaconda, you can also install Miniconda, which only includes Python and several essential packages. You can download the installer for your platform, see instructions here.

Using Anaconda

With Anaconda or Miniconda installed, you are all set for development. Several quick notes that could help you have more fun.

conda, a package manager to rule them all

Conda is the command line package manager that solves a lot of issues with package and library management with Python. It is actually a package manager not just for Python, I even found NodeJS libraries there.

A quick list of features conda provides:
  • virtual environments: it enables you to create separate environments with different Python version, list of libraries, etc. Something Virtualenv tries to provide, but much easier.
  • package management
  • build and distribute packages: you can either use Anadonda Cloud service, or host your own easily.

To learn more, check out conda cheat sheet (PDF), read conda official doc and watch the demo video (around 20 min, highly recommend).

Anaconda Cloud

Anaconda cloud (previously known as Binstar) is a hosted package management service for notebooks, environments, conda and PyPI packages, etc. Several quick links:




IDE integration

Anaconda can be easily integrated with your favorite IDEs, as mentioned here. To be frank, I am not aware of so many Python IDEs. I mostly use either text editor (such as VIM) or PyCharm from JetBrains.

The latest PyCharm already supports conda. All you need to do is add a new interpreter in preferences, and set it to your Anaconda python installation (e.g. ~/anaconda/bin/python) or the specific conda environment python installation (PyCharm supports both VirtualEnv and Conda env).


Ok, that's about it, hope you enjoy Anaconda and Python without the hassle of dev environment setup anymore.

Tuesday, December 22, 2015

how Dropbox and Evernote file sync works

Spent some time trying to understand better how Dropbox and Evernote sync changes across different devices (mobile, desktop client, Web browser, etc.). Here are some of the interesting papers and articles that I found. Note that they might be outdated a little (you can see the timestamps).

Dropbox:


1. IMC'12 paper "Inside Dropbox: Understanding Personal Cloud Storage Services" by Idilio Drago, etc. (Nov 2012).

The paper studies the Dropbox architecture and traffic by intercepting and analyzing traffic data. It provides many insights about various control and data flow, service components, protocols, etc.

2. "Streaming File Synchronization" from Dropbox tech blog (July 2014).

This blog post gives an overview of file sync and presents a new stream-based sync mechanism that improves latency by upto 2X.

3. "Inside LAN Sync" from Dropbox tech blog (Oct 2015).

This blog post describes a new enhancement called LAN sync, which allows devices on the same network to share files without upload/download through Dropbox servers.

Evernote:


1. Core Concepts in dev API doc, especially the Data Model, which defines various data structures, familiar to you if you use Evernote a lot, like me.

2. Synchronization specification in dev doc: "Evernote Synchronization via EDAM" (Jan 2013)


If you are interested, you can also look into FUSE, which is a user space filesystem abstraction. You can build your own Dropbox and Evernote ;-)

Sunday, September 27, 2015

order of event queue in Node.JS event loop

I was reading the book "Node.JS in Practice", in Chapter 2, Technique 14, it talks about the order of scheduling I/O events, setImmediate, setTimeout/setInterval and process.nextTick. Like in the following screenshot from the book:


However, it seem that the ordering is a bit different, from my test, it is more like: process.nextTick, setTimeout/setInterval, setImmediate.

Here is the sample test code and result (Node.JS v4.1.0):

setImmediate(function() {
    console.log("from setImmediate");
});

setTimeout(function() {
    console.log("from setTimeout");
}, 10);

process.nextTick(function() {
    console.log("from process.nextTick");
});

console.log("hello");

output:
hello
from process.nextTick
from setTimeout
from setImmediate


Monday, September 21, 2015

A retouch of JavaScript and Node

It has been more than two years since I last touched JavaScript and Node. My current projects are mostly centered around Java. But it was always fun to come back and refresh my memory and knowledge about it.

It almost feels like traveling through time machine, too many things have changed over the past two years. Node was forked into io.js and ES6 has been out and getting traction. And there are new frameworks out each week. Frontend engineers must have been having a tough life to keep up, and tons of fun as well.

Here I just keep some book notes for my reading through the book "Node.js in Action". The book itself turned out to be a bit outdated, but many basics are still valid. There is also another more recent Manning book "Node.js in Practice".

Difference between exports and module.exports
exports is a reference to module.exports, so as long as you do not reassign it to something different, they are the same.

Module search sequence
search for node_modules recursively from current directory, if not found, then search for environment variable NODE_MODULE or NODE_PATH.

Module require search sequence
Search for index.js, then inside package.json to find the "main" definition.

Flow control libraries
See article for a discussion about how to handle async programming with JavaScript: How to survive asynchronous programming in JavaScript

unit test frameworks
should.js: assertion lib: https://github.com/tj/should.js/

functional test frameworks

Keep app running/restart


Node Web frameworks, so many of them
Hapi from walmartlabs: http://hapijs.com
A good intro about why Hapi was created: http://hueniverse.com/2012/12/20/hapi-a-prologue/

Kraken from paypal: http://krakenjs.com
Koa, next gen Node Web framework using ES6: http://koajs.com
LoopBackhttp://loopback.io
Sailsjs, MVC pattern similar to RoR: http://sailsjs.org





Performance comparison of Express, hapi, and Restify: https://raygun.io/blog/2015/03/node-performance-hapi-express-js-restify/

Monday, April 20, 2015

Performance Testing with JMeter for REST Services - A Quick Start Guide

Started to pick up JMeter for a project that exposes a REST API. It is a quite versatile and popular performance and stress test tool. And people have built plugins to extend it.

Official guide: https://jmeter.apache.org/index.html
Extra plugins: https://github.com/undera/jmeter-plugins
Two books by Bayo Erinle: Performance Testing with JMeter 2.9 (2nd edition is coming out soon) and JMeter Cookbook

1. Install JMeter and Plugins

If you are on a Mac, using Homebrew is the easiest way, it installs both vanilla JMeter and extra plugins:

brew install jmeter --with-plugins
==> Downloading https://www.apache.org/dyn/closer.cgi?path=jmeter/binaries/apache-jmeter-2.13.tgz
==> Best Mirror http://mirrors.sonic.net/apache/jmeter/binaries/apache-jmeter-2.13.tgz
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/JMeterPlugins-Standard-1.2.1.zip
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/ServerAgent-2.2.1.zip
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/JMeterPlugins-Extras-1.2.1.zip
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/JMeterPlugins-ExtrasLibs-1.2.1.zip
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/JMeterPlugins-WebDriver-1.2.1.zip
######################################################################## 100.0%
==> Downloading http://jmeter-plugins.org/downloads/file/JMeterPlugins-Hadoop-1.2.1.zip
######################################################################## 100.0%
🍺  /usr/local/Cellar/jmeter/2.13: 1926 files, 115M, built in 74 seconds

If you are on other platforms, do these:

  1. download JMeter from the official site.
  2. download and install extra plugins based on your use cases. These plugins are grouped into several packages, see the detailed plugins package content page to decide what you need. I only installed "Standard Set" and "Extra Set". Installing these plugin packages is simply unzipping them to the installation directory of JMeter, e.g. they install jars into install_dir/lib/ext, or install_dir/bin, etc., see plugin installation guide for more details and minor config changes. The download page has all the links for the plugin packages.
  3. ServerAgent-X.X.X.zip contains the PerfMon Sever Agent that you need to run on the server under test. To run the agent, no special permissions are required. After the agent is running, you can use PerfMon Metrics Collector Listener to connect to the agent and monitor various metrics for CPU, Memory, Swap, Disk and Network I/O, etc. See the document for PerfMon Server Agent and Servers Performance Monitoring for more details.


2. Using JMeter and Plugins

JMeter runs in various modes, you can use with a GUI client or without, you can also set up remote test clients for distributed testing to simulate a more practical workload and traffic pattern.

Here is a basic test for HTTP GET request for a sample service API from geonames.org:

URL: http://api.geonames.org/citiesJSON?north=44.1&south=-9.9&east=-22.4&west=55.2&lang=de&username=demo

Here is the screencast (click on it), note those listeners starts with "jp@gc" are from the non-standard plugins we installed above.





3. Other Interesting Tools and Resources


gatling.io: another high performance open source load testing framework based on Scala, Akka and Netty. It has a DSL based on Scala. It also has nice integration with Jenkins.

yandex-tank: Load testing tool written in Python. For more details, check out its documents.

BlazeMeter: a hosted performance testing service, you can easily reuse your JMeter test scripts with it. It also provides integration with Jenkins CI/CD and supports mobile performance testing. Here is a quick screencast from its website:





Loadsophia: This is a service provided by BlazeMeter, it stores and visualizes the performance test results. The organic visualization in JMeter is quite limited and non-interactive. This service makes analyzing performance data intuitive and fun. It supports test results from tools like JMeter, Apache Benchmark and Yandex.Tank. You can see examples provided publicly by existing users here: http://loadosophia.org/examples/

flood.io: Cloud load testing tool, it supports JMeter and Gastling. Here is a sample report: https://flood.io/d384673f64e3a3


Sunday, February 22, 2015

Insights into the success of Storm

Just read through Nathan Marz's post about the history of Storm. This is really a nice recap of how to successfully start, grow and maintain a great open source project. Technical excellence is important, but the marketing, growing the community and adoption is even more critical.

Here is his post "History of Apache Storm and Lessons Learned".

I am really looking forward to his new book on the lambda architecture: Big Data, principles and best practices of scalable realtime data systems.