Tuesday, September 26, 2017

Probabilistic Data Structures and Stream Data Processing

The scale and the new way of stream processing has given rise to many interesting data structures and algorithms.

Here are some good resources that cover the topics like:

  • LogLog, HyperLogLog
  • Sketch (Count-Min, Count-Mean-Min, etc.)
  • Bloom Filter, Cucook Filter
  • t-digest, Q-digest, etc.

"Probabilistic Data Structures for Web Analytics and Data Mining", a really nice overview from Highly Scalable Blog:

A comprehensive list of the papers, presentations and talks by debasishg

Stanford CS369G: Algorithmic Techniques for Big Data

MIRI Seminar on Data Streams (Spring 2015 Edition)

Thursday, December 8, 2016

Linksfest to Get Started with Apache Flink

Flink, another great data processing platform, has been a rising star this year. It is a high performance stream and batch data processing data platform, with fault-tolerant, scalable, distributed data stream computation at its core.

Here are several links and resources to get you started.

Company and Community
dataArtisans, company behind Flink
Google Trends comparing Flink, Spark and Storm (Spark is still way more popular)


Introduction to Apache Flink, book from Flink core developers, highly recommend to start your Flink journey with. It is a Free download from MapR.

Flink in Action (MEAP, available in Spring 2017), the first chapter (PDF) is Free and gives a good overview.

Quick start guide

Talks and Videos

Alibaba slides on Blink, their fork of Flink, Alibaba is one of the biggest online e-commerce site in China.

Performance and Benchmark


2016 Holiday Guide for Robot Toys

Holiday is just around the corner and it's time to order gifts from "Santa". This year, I decided to give my son something different, something other than candies, chocolates, pokemon cards, lego, etc.

Since he has been exposed to basic programming concepts through code.org and Scratch. So..., how about a programmable Robot for this Christmas? Sounds good.

After doing some research and comparison, we ordered Ozobot Evo. To get started out of box, it supports a color-code based language for various actions, e.g. follow the black line and move forward, stop at the red color, rotating at the blue color and play color light and music, etc. You can also customize the action with a mobile app on the phone or tablet with an environment similar to Scratch.

When he grows up a bit more, we might introduce marty the robot to him. It looks and works more like the robots we know about. It also teaches some real mechanical dynamics for the kids.

Here are some notes I took during the research, hope it will be useful to you.

Ozobot Bit
only supports the colored line language
both 1.0 and 2.0 available on amazon (around $60)

Ozobot Evo
More advanced than Bit, supports the colored line language and a Scratch like visual programming language, can control the robot using a mobile app, supports social interactions with friends’ robots.
available on amazon (around $100):

available on amazon: $169.99

A playful companion, a robot that has personality, very cute!
available on amazon (around $300):

marty the robot
This is for more grown-up kids, more makebot-like robot, fully programable, start with Scratch, then move to Python. The way they dance together looks so funny ;-)
not available yet, currently on crowdsourcing

kids education robot and companion, not that programmable.
Founder from Shenzhen, China (around $230)

aido family robot
size of a toddler, family robot, assistant, voice control, helper, etc. Reminds me of Baymax in Big Hero 6 ;-)
available for pre-order (around $600): will ship in early 2017

Saturday, November 26, 2016

Java Concurrency Counters Benchmark

Java concurrency utilities have kept evolving and provides many different ways to achieve similar tasks. Recently, we had a task to implement a concurrent counter. This triggered my interest in comparing different ways and their performance under various read and write workload.

The end result is a simple concurrent counter implemented in various ways:
The benchmark is implemented using JMH, the standard way for reliable Java performance microbenchmark. You can find several really nice tutorials on JMH in the References section.

In my benchmark, there are write and read operations on the counter. The write takes 10ms and read takes 2ms. I set the number of read and write threads to simulate different mix of the workload scenarios using JMH group.

Both the source code and benchmark raw data, Excel sheets and visualizations can be found in the git repo: java-concurrency-counters-benchmark.

Here is a quick summary based on my experiment (I only set 2 rounds of warmups and 2 rounds of benchmark due to limited time):
  • AtomicLong and LongAdder has similar throughput. In read-heavy workloads, AtomicLong has better read and write throughput than LongAdder. In write-heavy workloads, LongAdder has slightly better write throughput.
  • Fair lock has lower throughput than regular lock in general, but not always.
  • Consider using ReentrantLock or ReentrantReadWriteLock if you need high read throughput and the concurrency level is high.
  • StampedLock provides very good write throughput in all the read-write mixes, if write throughput is important to you, you can try it. At the same time, if you need comparatively good read throughput, try optimistic read StampedLock. It has really good read throughput when concurrency level is high compared with regular StampedLock.

Special thanks and references:

Tuesday, May 17, 2016

Learning Data Visualization

Data visualization provides insightful tools to visually analyze the data, observe the trend, compare the data series, filter out the data noise, etc.

I spent some time learning several most commonly used JavaScript data visualization libraries. It is really exciting to turn monotonous numbers into beautiful charts.

Here is the git repo that has the sample charts I am playing with: https://github.com/guozheng/learn-dataviz

If you want to quickly create charts using available ones, I'd recommend using either HighCharts or Google Charts. If you need to do heavy customization, or you need to create new chart types, then D3.js, NVD3, C3.js, React D3 provides the D3.js based solutions, very powerful and flexible.

Thursday, December 24, 2015

Set up a Python dev environment easily for data scientist

More than 3 years ago, I wrote about how to set up Python3 dev environment on the Mac OSX. You need to jump through several hoops to get the job done.

Now, driven by the needs of data science, Python has become the 4th most popular language (according to TIOBE Index for Dec) and there have been a lot of interesting work to improve the usability of the tools.

Based on the homework I did today, the easiest way to set up your python dev environment is simply by using Anaconda. It is a great open source analytics platform from Continuum Analytics. It comes with toolings such as conda, the package manager, and many popular python libraries for data science needs. The company also offers cloud-based services for life cycle management of python packages, notebooks, etc.

Fat Installation with Anaconda

Simply follow the instructions here: http://docs.continuum.io/anaconda/install. By default it installs to your home directory (~/anaconda), which can be customized with the installer. You need to add ~/anaconda/bin to your PATH if the installer does not patch your PATH environment setting.

To update your Anaconda installation, simply run:
>conda update anaconda

Conda is a great package manager for Python, more details on conda later in the post.

So, what got installed? You can find the detailed list here. If you also need to work with R, you can install r-essentials by running:
>conda install -c r r-essentials

This installs "IRKernel and over 80 mostly used R packages including dplyr, shiny, ggplot2, tidyr, caret and nnet".

Slim Installation with Miniconda

If you do not want to use the fat installation from Anaconda, you can also install Miniconda, which only includes Python and several essential packages. You can download the installer for your platform, see instructions here.

Using Anaconda

With Anaconda or Miniconda installed, you are all set for development. Several quick notes that could help you have more fun.

conda, a package manager to rule them all

Conda is the command line package manager that solves a lot of issues with package and library management with Python. It is actually a package manager not just for Python, I even found NodeJS libraries there.

A quick list of features conda provides:
  • virtual environments: it enables you to create separate environments with different Python version, list of libraries, etc. Something Virtualenv tries to provide, but much easier.
  • package management
  • build and distribute packages: you can either use Anadonda Cloud service, or host your own easily.

To learn more, check out conda cheat sheet (PDF), read conda official doc and watch the demo video (around 20 min, highly recommend).

Anaconda Cloud

Anaconda cloud (previously known as Binstar) is a hosted package management service for notebooks, environments, conda and PyPI packages, etc. Several quick links:

IDE integration

Anaconda can be easily integrated with your favorite IDEs, as mentioned here. To be frank, I am not aware of so many Python IDEs. I mostly use either text editor (such as VIM) or PyCharm from JetBrains.

The latest PyCharm already supports conda. All you need to do is add a new interpreter in preferences, and set it to your Anaconda python installation (e.g. ~/anaconda/bin/python) or the specific conda environment python installation (PyCharm supports both VirtualEnv and Conda env).

Ok, that's about it, hope you enjoy Anaconda and Python without the hassle of dev environment setup anymore.

Tuesday, December 22, 2015

how Dropbox and Evernote file sync works

Spent some time trying to understand better how Dropbox and Evernote sync changes across different devices (mobile, desktop client, Web browser, etc.). Here are some of the interesting papers and articles that I found. Note that they might be outdated a little (you can see the timestamps).


1. IMC'12 paper "Inside Dropbox: Understanding Personal Cloud Storage Services" by Idilio Drago, etc. (Nov 2012).

The paper studies the Dropbox architecture and traffic by intercepting and analyzing traffic data. It provides many insights about various control and data flow, service components, protocols, etc.

2. "Streaming File Synchronization" from Dropbox tech blog (July 2014).

This blog post gives an overview of file sync and presents a new stream-based sync mechanism that improves latency by upto 2X.

3. "Inside LAN Sync" from Dropbox tech blog (Oct 2015).

This blog post describes a new enhancement called LAN sync, which allows devices on the same network to share files without upload/download through Dropbox servers.


1. Core Concepts in dev API doc, especially the Data Model, which defines various data structures, familiar to you if you use Evernote a lot, like me.

2. Synchronization specification in dev doc: "Evernote Synchronization via EDAM" (Jan 2013)

If you are interested, you can also look into FUSE, which is a user space filesystem abstraction. You can build your own Dropbox and Evernote ;-)