Stayman's Blog

A blog belongs to a DevSecOps consultant, hacker, and entrepreneur.

OpenCV Object Detection Tutorial - Getting Started by Training Your Own Car Detector

I’ve mentioned the visual recognition solutions in a previous blog post. However, I was not able to find a service that locates a specified object in an image. So I started out my OpenCV journey. There are human and face classifier available, well-trained. If you want to detect something else, you will probably end up with training a classifier yourself. Unfortunately, I was not able to find a complete tutorial about how to train a classifier. If you run into the same situation, I hope this post can help you.

Visual Recognition / Computer Vision as of Feb 2016

Recently I did some research on computer vision aka visual recognition due to a project. Surprisingly, I found out that this is yet another cutting-edge research field. Big companies are in this market. Meanwhile, new startups are also pushing the tech limit and trying to get a piece.

Build Your Own Search Engine - ElasticSearch, AWS CloudSearch and Sphinx - a Brief Comparison

Recently, I’ve built two search engines, 1) NodeJS + CloudSearch, and 2) Rails + ElasticSearch. And have a third work in progress, which utilizes Sphinx. For the third one, I’m only working on the ops part. I think these are the most popular options for search engine. I’m going to share a few thoughts here. Also briefly mention Solr here, I didn’t try that, but it looks pretty good for heavy workload.

All right. First of all, why even bother building a search engine? Take a look at this post. In short, nature language processing (NLP), informatioin retrieval (IR), and search engine are very complex. How complex is it? Go to this course to learn more if you wish. You need something specialized for that. While relational database also offers full text search feature now, that’s still not their forcus. It’s not optimized for performance nor capable of processing complicated text search query. On the other side, specialized search engine will have many features built in, as well as, be able to extent or customize feature & behaviour via configuration and plugins. The only thing you loss by using a search engine is the ability to join tables. That could be big depending on your use case.

ElasticSearch

Quite staight forward to setup via chef cookbook. Ansible playbook is also available. It’s fairly easy to scale, but certainly CloudSearch is the most simple to setup, manage and scale since it’s a fully managed service. It’s open source. That means the cost is lower than CloudSearch. Get started in a few minutes.

Another good thing about ES is that the ecosystem is rich, you have a variety of plugins for different languages, and features. For example, for Chinese alone, there’re five analyzers. There are also plug n play gems for Ruby on Rails. I used Searchkick. Perfect integration with ActiveRecord, and worked with ORM that implements ActiveModel interfaces. Super easy to use.

AWS CloudSearch

It’s fully managed. That means there’s no need for you to figure out how to configure the service, no need to worry about operation/maintenance, and no need to think about scalability. Besides, it has certain intergration with other AWS services, S3 and DynamoDB, to be specific. Fairly simple for ops.

However, the ecosystem is not quite matured yet. Can not find any easy to use node package. Have some gems for ruby, but looks like not fully developed yet.

Sphinx

I didn’t have many experience for this yet. Spent a few hours trying to install it via Chef, but end up unsuccessful by far. That make me fell that ES has a better user experience, at least good to get started.

Sphinx itself doesn’t take care of data store actually, most likely you will still need setup MySQL/postgresql/Percona as its data store. That means you will need to manage two layers in order to use it. That also implies that it won’t scale very well for huge amount of data.

Despite of those, it’s still pretty good for dev. Gems available for Rails - Thinking Sphinx. Similar with Searchkick. Plug n play.

In addition, it can directly talk to MySQL (and other RDBMS). That makes it a little easier to import the documents if you have some existing data already.

One note about multi language

Regarding multi language support, I had a hard time make it even work properly in ES. Didn’t try it for CloudSearch. The trouble first started when someone created a document with certain Korean keywords, while searching those keywords returns nothing.

First thing came in my mind is encoding/charset issue. However, it turns out that’s not the case, everything is indeed encoded in utf. And the charset is same.

Later, after some research and asking a Korean speaker, I realize that Korean char has two presentations. Composed & Decomposed. There’s a gem Gimchi for that. You will understand that after looking at the Gimchi example.

What I end up with doing is using unicode lib to compose every string before put it into database. And compose all the search query as well.

However, I’m still expecting for issues once more languages get inputed into the system. And ultimately, I’ll separate all the languages, cuz each language requires separate processer to achieve optimized search result.

Some Resources Worth Checking Out Before You Start a NodeJS Project - Part 2

Part 1 is here

It has been a week since the first post. After seeing the application running in production, I’ve got some more to share.

First thing is about debugging and troubleshooting. That’s evil. As shared here - “Your debugging still largely remains as guess work.”. This is 100% true. And the problem got even worse in prod.

In dev phase, I’ve seen a lot of this kind of stack traces: Trace: at EventEmitter.<anonymous> (/project/src/routines/debug/exceptions.js:4:17) at EventEmitter.emit (events.js:88:20)

or this:

Error: Oh no! Event Error!
    at someAsyncHandler \[as _onTimeout\] (/Users/danielhood/Dev/Workspace/blab/routes/index.js:7:12)
    at Timer.listOnTimeout \[as ontimeout\] (timers.js:112:15)

Basically, in nodejs codebase, you will have anonymous functions and async function calls everywhere. Usually, with those error message, you don’t even know which line of code is associated with the error. That sucks.

However, there’s actually some solutions here. I wish I could read those tips before I started the project.

When the app got into prod, troubleshooting is tougher. The issue is that, usually, most app deployment will offer you a catch-all error log, either from web server (apache, nginx, etc), from app server (php-fpm, passenger, etc), or from the framework (CackPHP, Rails, etc). NodeJS, in contrast, has nothing out-of-box. The app process will simply die when an error occurs. Thus, from nowhere you can attain the error log. NewRelic couldn’t do that as well.

And actually, that’s why I like Ruby, and Rails. Even you are a newbie, you probably won’t make the app too messed up. By simply choosing the framework, you got covered for many pitfalls you would encounter in the future, even though you might not be aware of. For example, CSRF protection, assets caching, etc.

OK, here comes the second issue. NodeJS, by default, runs in single process single thread mode. Other languages and frameworks are the same actually. The difference is that, when you deploy a non-nodejs app, the web server or app server usually spawn and manage multiple processes/threads for you. But with nodejs, you have no process manager by default. Therefore, you will need to spawn processes by yourself to utilize multiple cpu cores. This function is implemented in cluster. You will also need a process manager to oversee your application, and restart them when one fails. You have got several choices here, nodemon, pm2, and forever.

As a result, I encountered a third problem in prod. In my app, I primarily have a bunch of background workers. I used AWS OpsWorks to configure the NodeJS layer. Ever since I launched the app, I saw more and more tasks got stuck in my task queue with active status. Since I had no error log, nor I can find anything in NewRelic, it was a totally guess work.

At first, I thought it was becaused of uncaught exceptions. Spent a whole day refactoring all the code, try to wrap any suspects code that could throw error, and add error logs wherever essential.

The second day, the task queue is still stuck. Then I thought, ok, maybe the tasks actually got proceeded well, but the redis machine which stores the task queue got overloaded. Since, there are several thousands tasks get proceeded every 10 minutes. Therefore, I tried stopping all non-essential tasks. Unfortunately, that didn’t help.

Maybe somewhere in the code, especially in async functions, throws error and I didn’t wrap it? I third throught. And I got no idea how could I verify or test the hypo.

And Finially, the third day, I though, maybe the worker process got terminate/restarted in the middle of a task. I found this post. OpsWorks actually ping port 80 every minute as health check for nodejs layer. And because my workers do not listen on port 80, it was considered unhealthy and got restarted every minute. AWS really should mention that in their starter’s guide. At the end, I added a health check for workers listening on port 80. The issue got resolved, and I could take a rest.

Hope this post and part 1, could help some people that are planning to write app in node or deploy nodejs in AWS with opsworks.

Some Resources Worth Checking Out Before You Start a NodeJS Project - Part 1

Part 2 is here

Last week I rushed through a nodejs project. It’s my first nodejs project that launched in production. Learned a lot about nodejs and got a deeper grasp about it. Wanna share about my experience and my understanding so far in this post. I’m also gonna list a bunch of resources that I went through and feel worth reading.

First of all, to get started with JavaScript, I’d recommend read the book JavaScript: The Definitive Guide. I think it can both serve as a beginner tutorial and a advanced guide. It basically covers everything about the language. From lex, syntax, to the underlying concept and mechanism. I, indeed, agree that JS is a horribly designed language. It’s easy to start writing some functioning program without a good understanding. However, it’s quite tricky for beginners to comprehend how the language actually works. This book even worths reading a second time. Propably write a good amount of JS code before start the second read.

Ready to dive deeper into this language? Go to this book: Node.js Design Patterns. It covers the culture and philosophy of the community behind the language. It also clears out some very essential ideas that people need to deal with in practical tasks. For example, callback pattern, sync vs async, and etc. I’d say this language is very against human nature. Although good design patterns (even in other languages) often requires developer to go a little wicky wacky, NodeJS requires you to think in a totally oppesite way. If you code in a straight forward manner, I bet it will become a mess in 100 lines.

The very first thing you want to avoid is the Callback Hell. It will simply make your code unreadable and unmanagable. Try every means to avoid that. I made several callback hells at the beginning of the project. Ended up with hours and hours of work to refactor those.

The second thing is async vs sync. You will see callbacks everywhere if you write in JS. However, you usually cannot tell which one is sync and which one is async. At least, I couldn’t tell. At the beginning, I though all the functions that need to pass a callback are async functions. That’s simply not true. One reason that drove me to use Node for this project is performance. The problem is that if you are not utilizing asycn functions, you loss the benefit of using the language. So watch out for that.

All right. There’s still some good thing about Node.

First, nvm - node version manager and npm - node package manager. If you come from Ruby world, those are your old friends. They are life savers. Delivers you from Dependency Hell. Need to mention, it’s recommended install nvm and npm as a non-root user, your app don’t need that previlege. Avoid install packages with the -g switch. Include your ./npm/bin in your $PATH instead.

Next, there are a lot of packages available in npm. I believe the amount is much greater than gems in Ruby.(Checkout yourself). Gems are rich in the webapp field, espacially Rails + RDBMS. However, looks like npm packages covers everything. For this project, I need to handle natural language processing. Couldn’t find much gems. I had quite a good time struggle whether to choose Ruby or Node.

A list of popular npm packages - https://www.npmjs.com/browse/star. I almost covered over half of them in this project.

Web frameworks - express, restify, hapi, sails and loopback. For my project, the frontend is pretty simple. So I chose Express. Rails is still my favorate, though.

A list of natural language processing pachages. There are more not listed here. This is awesome. * wordnet.js * moby * pos * natural * gingerbread

Pretty much that’s it for now. Wish I could write more if I have got some time. Hope you enjoy the post.

A Naive Benchmark for a Click Count Application: NodeJS + Cassandra vs ROR + MySQL 简单性能对比(基于一个点击计数应用)

This post is about a naive benchmark: NodeJS + Cassandra VS Ruby on Rails + MySQL. There are plenty of benchmarks in the wild(db/web). However, it’s still difficult to decide which stack solution to use for your application. Actually, it heavily depends on the application/use case/scenario. This benchmark is based on a very simple scenario: a click count application. This benchmark is very naive! Be aware of that when you use it as a reference.

这篇帖子写得是关于NodeJS + Cassandra VS Ruby on Rails + MySQL的性能对比。网上其实有很多很多这样的比较(数据库/网站框架)。但是我看了以后还是觉得很难决定用哪个产品来做哪个应用。大家都论调不同,视角不同,比较的人也很难说对所有的产品都精通,难免有偏颇,甚至有的可能有广告嫌疑。反正,性能的对比很大程度上依赖于应用/案例/场景。我的这个简单的比较是基于一个简单的点击计数的案例。(这个比较非常初步,仅供参考,可能以后会更深入的研究一下)

Practical Linux - Part 19 - Final | 期终