Web Application Scalability Part-3 ES vs Redis

Its 3rd article in Web Scalability..Today I am looking at various projects going on in Indian Start ups and I am thrilled by variety of technologies being used.

While this sounds good at the same time I feel we are unable to utilize core strengths of technologies used.

Their usage in architecture will always be dependent on person to person and their experience and comfort level. But some simple things if adopted early stage can give immense benefits.

One such technology is our simple cache which is being used everywhere but alas not in the right way.

It’s called redis. Yes the kind of benefit this humble software can bring to create a highly scalable websites is awesome.

I have seen many startups are using Elastic Search as a “Cache”. Yes, I can not imagine someone doing this until you are total nuts. ES is an indexing technology and not a “Cache”. It will never update in real time which is what is expected from cache…It will never be in sync with underlying storage and hence almost always have stale data.

Using it as primary data source for anything is simply insane and shows the immaturity of the architect who chose it in the first place. It will not work except for most trivial projects.

Anyways ES has it niche but it’s not a cache but Redis is. It provides many advanced data structures and can do hell lot of thing in caching layer itself.

One needs to take a hard look at various data structures provided and how they can be used to create a “real” caching layer on top of it and get same output which one was hoping to get from ES out of the box.

It has sets/hashes/sorted sets/lists etc. etc.

Keep putting data at the rate your site might never experience and retrieve it at blazingly fast speed.

Most importantly its single thread model brings the consistency one might need for creating distributed counters.

Think about it and if you are confused how it can be used for your use case just like always drop an email..who knows I may like your use case interesting enough and help you

Web Application Scalability Part-3 ES vs Redis

Akka – A practical Example

This problem is a real life scenario in a production application and displays how Akka can provide highly scalable solutions to complicated problems

Background

1.    Channels are a mechanism for broadcasting messages.
2.    Messages can be sent a variety of ways, including via SMS.
3.    People follow one or more channels, and receive messages when channels they follow broadcast a message.
4.    Some followers have email, some have SMS, some have both.
5.    Channels can have a phone number so that when SMS messages are sent from a channel, the user can see a phone number for that channel. The user can respond and we know to which channel they are responding because the phone number is associated with that channel.
6.    Phone numbers cost money, and so they are a scarce resource. If they were free, every channel would be assigned a discrete phone number upon creation.
7.    For this reason, more than one channel can use the same number. This is fine, as long as no two channels with the same phone number are followed by same person. If one person is following two channels that both use the same phone number, when that person sends and SMS message to that shared phone number, we won’t know to whichchannel they are communicating. When one person is following two channels that both have the same phone number, this is called a “Collision”.
8.    If we don’t allocate enough phone numbers, then a person who is following more than one channel could
9.    Numbers are allocated on demand, that is, the first time that a broadcast for that channel needs to go via SMS.
10.  If phone number X is being used for both Channel A and Channel B, and no followers of either follow the other, then we are fine. As soon as a follower of A starts to follow B, we must change the phone number of one of the channels.
11.  When the phone number of a channel changes, we need to communicate to the channel users that the number has changed. It is very disruptive to change a number, and so we must minimize the number of users that are impacted. For this reason, we must take into consideration the number of recent SMS messages each channel has sent to determine which channel will be impacted the least by a change, and choose that channel to change the number.
Data model
class Channel(UUID id, String name, String phoneNumber)
class User( UUID id, String  name)
class Following(UUID  channelId, UUID  UserId)
class PhoneNumber(String  number)
Don’t worry about persistence for now. Maybe just use a mutable List to store the collections of objects for now.
You can populate the data store with random data.
If you find that this data model is insufficient, please augment as needed. Also, this model is missing broadcasts and broadcast statistics. If you need data that is not represented here, as in the case of broadcast statistics, create placeholder functions that returns random data for now.
Objective
1.    Build a mechanism in Scala that assigns to channels and changes phone numbers as needed.
2.    Minimize the number of phone numbers allocated.
3.    Do not allow collisions.
4.    Minimize the impact of a channel phone number change.
5.    Minimize the number of potential channel phone number changes by balancing the phone number allocations such that least used phone numbers are the first to be allocated when a new channel needs a number.
Here is the repository you can download source code from
Akka – A practical Example

Flipkart – Big Billion day and Scalability problems

Recently flipkart created quite a buzzword in Indian e-commerce industry. They sold 600cr Rs worth of goods in a matter of 10 hours as per their official statement. I want to focus my thoughts on tech side since I fail to understand how regularly they see outages and what they are doing to fix it since it has happened many times. This time it was more prominent but I get feedback from lot of people that they are down every now and then. Today technologies are so advanced that building and ecommerce website is not considered a big deal anymore..India has seen hundreds of ecommerce websites mushrooming over last few years with the help of investors. You can build a store in a matter of few days on Shopify and even if you want to build your own you dont have to build a payment gateway or search or cart or networking middleware…Already available piece of softwares will take you a long way. There is hardly any category for which there is no “specialized” website does not exist. Coming back to billion hits in a day with 5000 servers was surprising for me so I decided to do a quick analysis of what might have happened depending my past experience with ecommerce websites. Most of the these websites when started didn;t expect growth at such a rapid pace hence they were built using a web framework either Java based or PHP based a single “monolithic” chunk of software. Admin function, Inventory Management, Order Procesing, Payment Gateway, WishList, Search, Email Campaign all built into one big piece of software. Phew…We really cant blame them as twitter was also one big ruby on rails based application at one time. Difference is unlike twitter these websites lead by less experienced people in the industry were not able to embrace the design of distributed architecture either SOA based or otherwise. They continued building classes after classes in same codebase and were unable to keep a balance in good software design and features. There are couple of reasons for that a. Most of the websites are lead by people in their late 20s or early 30s. b. Their technical leads and along with CTOs are very less experienced again compared to companies like Amazon. Hence they are not able enforce agile practices like TDD and use best suitable stack for the job because they themselves never went through that process. c. They are afraid of changing things because they are afraid of breaking sthg which is already working. d. Unexperienced in tech and generously funded by venture capitalist they keep hiring people and want more head count to build new features quickly and support them since they know that a new ecommerce website can be launched pretty easily today. Anyways we will focus on “technical analysis” of Flipkart issue. They issued a statement that they have received 1 billion hits in one day. which comes down to 12000 hits per sec. If we assume that 2/3 hits come in first 8 hours it will come down to 24,000 hits per sec. If they had 2000 servers at most 20 hits per sec will come. Today’s state of the art commodity hardware can handle 10 times of that traffic provided you don’t write bad software. Amazon’s conversion rate is around 7% i.e. 7 out of 100 visitors who come to website end up checking out. So 93 people are actually just browsing the products either using search or by category. The key is to serve this traffic from different servers compared to other functions. so SOA is only solution when site grows big but you need to build a infrastructure where you can roll out services and define separate SLA for each of them. This infrastructure should support creation,consumption, discovery and seamless deployment. However this will take time experience and hard work while making sure every person is on same page. This will also reduce tech cost in the long term provided you are eyeing for it. Recently came across an interview by flipkart CEO where he claims that there architecture is well designed an SOA model just like Amazon but there are many complex algorithm in their system which kick-in only in specific scenarios and do not crop up during regular stress testing. SOA is easier said then done. Biggest problem is S (Service). How do you identify that a certain piece of code should be packaged as a separate service and not part of an existing service. Guiding light according to me  are following questions. a. “Do you believe that in future you would want to maintain and evolve this code separately”. b. Do you think its gonna change very frequently and whether it is going to be used by other services. c. Do you think this piece of code can affect the SLA of existing services significantly. i.e. Is it a complex piece of algorithm that will make the performance of existing services unpredictable when on execution path. If answer to any of the above questions is yes then one should wrap this piece of code in a new service. Now problem is many a times while starting the project you are not sure about answer to all these questions. Some day business requirement gets changed and you find out it would have been better if I would have created a new service for it. But code is a problem because it is not designed in a way that you can easily abstract it out and put it in a new service so you are doomed. This is exactly the nature of enterprise software and that is why experts in software are always looking for new way of developing software either through various libraries or going to the extent of creating a new language so that problem can be addressed before they even appear. Functional programming and tools like Play Framework and Akka as of today offer a lot to the developers and can be leveraged to create truly SOA architecture and if there programming paradigm is followed in true sense then Akka will help you create a distributed architecture out of monolithic piece of code in much less painful way. One just need to be brave enough to understand these tools and be ready to take risk. In my next article I will show a use case of E-commerce which will do exactly this. i.e. Without changing a piece of code i.e. without creating a new WebService or putting a JMS in between or writing custom networking code, we will create a distributed app out of monolithic chunk. Happy Coding!!

Flipkart – Big Billion day and Scalability problems

Web Application Scalability – Service Layer

So In our last post I promised that we will talk about how to make monolithic code into distributed SOA architecture.

Well its not easy. Once you have decided that you want to re-architect the single chunk of software in distributed manner you have to decide about different parts of the system which can be deployed on different machines and still everything will work fine.

There are many problems while doing so

a. How will you host these services (S in SOA)

b. How will you communicate and how serialization/deserialization will happen?

c. How do you make sure that you are able to implement the same workflow which you had in single monolithic component assuming you were doing things sequentially.

 

To answer the first question there are many options available but I will list down some which I have personal experience with

 

a. Design your services as RESTful and deploy them in you preferred container like weblogic

Pros

– Location transparency. You refer to the services using URI..

– Interacting with RESTful service is pretty easy in any language

Cons

– Serious overhead of HTTP protocol

– Serious maintenance issues with container management not to mention huge performance hit

b. Design your services as Java processes and interact with them using a service bus like JMS (ActiveMQ, Fiorano etc.)

c. Design your services as Thrift based and interact with them by passing pre-generated client code

Pros

– Serialization/Deserialization take care by thrift. You just interact with POJOs everywhere

– Lightweight highly scalable proven architecture (Facebook?). Forget about converting from Json to POJO and vice-versa

Cons

– Every time service or schema is changed you need to redistribute client code.

– Schema definition in thrift file has some limitations of its own

d. Design your services as separate Java Processes and have a networking library like ZeroMQ take care of communication

Pros

– ZeroMQ is battle tested and used in many mission critical projects. Takes care of networking needs

Cons

– Low level programming is needed compared to other approaches

– Many scenarios will have to be handled even with this approach so code size will swell up

For long time I had been dealign with Hub and Spoke Model by using a ServiceBus like MSMQ, ActiveMQ

ServiceBus is a popular approach especially in financial institutions who invest in enterprise grade product like TIBCO MQ.

Problem with service bus approach is that you have to write code specific to Messaging Bus. So another layer is added which increases complexity

So for some time I had been tracking if there is a way to remove this layer. Given my experience in Industry there are some areas I believe a company should not invest until absolutely necessary and one such area is networking.

Having said that one day while learning Scala I came across a project called Akka. Akka is a framework which addresses many problems with simple concept of Actors.

Networking and Concurrency provide easy way to program local or distributed services. What you get is complete location transparency as you communicate with remote actors in same way as you communicate with local actors.

There is no hub and spoke model so no new software has to be installed/maintained or written specific code to make things work.

It’s P2P…networking complexities are hidden deep inside and you never have to deal with thread. WOW !! Sounds too good to be true..It is.

Completely asynchronous with a programming paradigm that is easy to understand and concept of actors is already proven in telecom industry.

In the next post we will talk about some sample code for how location transparency can be achieved.

 

 

Web Application Scalability – Service Layer

Web Application Scalability – Part 1

Foundation

I have been thinking about writing about software architecture for a while. There is lot of confusion among programmers about how to design software at scale. Teams fight about a programming language/framework/library/tools and what not. So I intend to share my experience in this field and I will keep writing new articles and improvise existing ones.

Software architecture is a field that is highly based on

  • your experience
  • how many choices you made independently
  • if you made any mistakes did you learn from them
  • Are you up-to-date with latest
  • Knowing about pros and cons of each choice you are making

This will be a 3 part series.

Each part will deal with one layer in the Web Application. We will look at aspect of scalability, problems faced today, relevant technologies in today’s date and how to utilize them.

A traditional Web Application is made of three layers.

  1. Presentation
  2. Application
  3. Persistence

For more than a decade this architecture has been very popular in both Open Source community (Java, Ruby on Rails, PHP) and proprietary technologies (Asp.Net)

To achieve this architecture excellent frameworks are available across different technology and its easier to find talent.

From a typical web framework there are two main areas where developers seek help:

How to make web development easier?

Scale on demand – to be able to handle large traffic.

So if this is a popular architecture then what’s the problem?All web frameworks that I have worked with or that I know of primarily solve first problem.

But they fail miserably while answering second question. Developer is left on his own whenever scalability comes into picture. Fortunately large social networking websites have shown us the way forward and given us many tools which if utilized effectively can deliver this second aspect called scalability.

 

We will talk about second point in this series. When we talk about handling traffic here we are assuming that hardware available to us is Commodity hardware.

My approach will be practical by using as less jargons as possible so that a programmer with couple years of experience can easily understand the how he can make design and framework choices to achieve this nasty goal called scalability.

 

Web Application Scalability – Part 1

My opinion about PHP

PHP has become one of the primary technologies while developing web applications these days. While working for Talentica  I can recall at that time there was only one project on PHP and mostly people were doing Java. After PHP5 it regained it’s strength and became one of primary stack for developers. I would try to explain what is my thinking about it.
 
There are 4 stacks that are very popular these days for an internet application to be built in
 
a.) Java world – Spring/Grails ( I have worked on both)
 
b.) .Net world – Asp.Net/C# Combination
 
c.) Ruby World – Ruby on Rails
 
d.) PHP – LAMP Stack (Zend/Symfony)
 
Ruby on Rails, PHP frameworks and Grails are built around same philosophy i.e. Rapid Application Development. They follow an approach called Convention over configuration. 
 
They all follow same development style and are supported by dynamic languages Grooy/PHP/Ruby. e.g. MVC is the universally accepted pattern for developing Web Applications now. 
 
An internet application would typically have three layers
a.) Presentation Layer – All frameworks provide a rendering engine to easily create web pages. Some even provide pluggable engines so you get a choice
 
b.) Busienss Layer – Usually is written in one primary language but often interacts with external services
 
c.) DB Layer – ORM like hibernate/GORM/ActiveRecord/Doctrine.
 
Almost all stacks provide an easy to use ORM layer which conceptually works in more or less same manner.
 
I do not see how someone experienced in GORM would have any problem working with Symfony ORM or Active Record.
 
Consider search which is a very important factor for a website. You will integrate a tool like Solr/Elastic Search in your application. It hardly matters what language you use for accessing Solr, you will still be doing similar calls. 
 
As far as MySQL is concerned it is Database and is totally independent of language being used though there might be some differences in syntax of ORM layers. I have worked on C++/.Net/Java and Grails and I find myself equally at ease because underlying remains same.
 
Yes it is true that you can do faster development if you have someone who is experienced only in the stack you are using but things change when it comes to scalability.
 
Having said that largest of the applications today follow an approach called “right tool for the right job”. e.g. FB uses PHP as their Front End application but in the backed they have whole lot of languages including Java/C++/Erlang. That is how they achieve the scalability.
 
Facebook long before created a project called Thrift just to achieve seamless integration between different languages and is widely used today. I have used it myself to call Neo4J API in Java from .Net as .Net apis were not available for this graph database.
 
Even twitter story is very famous and they all solve their scalability problems on JVM and not Ruby on Rails which has been their primary stack for a long time. 
 
PHP by design is not a multithreaded language and I am sure that people when they look for scalability they more often than not try to go to JVM and apache projects. This is called Polyglot style of programming where you do not stick just to one stack but do the right thing.
It is tremendous tool for creating sites quickly and gives many features/plugins out of the box but it can not solve all the problems a successful website will face in future.
 
I know some of the e-commerce company who quickly created a website by using Magento but now find it extremely difficult to scale due to it’s design of database which follows EAV model.
 
Though scalability is again subjective and it depends on what you are looking for. One may argue that many sites will never reach the scale of Facebook and Twitter but my argument is simple that if using right tool is the approach then it will not harm you. It is like Test Driven Development. Some people find it wastage of time but gives you robustness undoubtedly. 
 
Take another scenario of NoSQL. I have used couple of NoSQL solutions and for a start up it is important to save cash, they do not try to Scale UP as bigger machines are quite expensive but scale out using commodity hardware. Hence NoSQL. NoSQL is analogous with Big data these days.
 
Again this is called polyglot persistence and is heavily in use today. It is easier to keep using MySQL but there are problems where MySQL does not scale well hence you look to add a new stack in your development process. 
 
I have introduced new technologies time to time very systematically and that is how I have been able to solve scaling problems. These technologies at the same time have saved considerable time of the team. e.g. Using MongoDB we were able to solve long standing problems of sending and storing bulk emails for campaigns. 
 
So I am of the opinion for different jobs different tools are required and they should be used systematically in a timely fashion. If right strategy is adopted they can give tremendous results in productivity and scalability. 
 
I have written a small article about Apache camel, one of the tools that I used on my blog as well. 
 
As far as my experience with PHP is concerned then apart from fixing some bugs in an existing PHP application. 
 
As part of my job I would not mind using or learning any technology including PHP but my effort would always be in the direction of doing the right thing keeping in mind time and budget constraints. Having worked in different stacks in my career I think this is something I really like about myself.
My opinion about PHP

No SQL and it’s importance

Just attended a conference at thoughtworks office in Delhi. It was a great talk. Neal Ford was phenomenal and he really showed how technical presentations should be given.

They do not have to be boring. To my surprise he has also written a book about presentations.

Anyways, coming to the point. Talk started with introduction to No SQL, what it is and what kind of use cases it might be fit. As expected lot of people were from RDBMS background so it was very hard for them to initially understand the concept of No SQL.

Fortunately that was not the case with me as I have been exploring these technologies for last couple of years and I have delivered some successful projects using Neo4J and MongoDB.

So I would like to put my thought process forward. 

No SQL means that people in SQL world should look out for alternative persistence technologies when need arises. Lot of times when data needs to be stored SQL does not provide a natural way of storing it. 

Take for example hierarchical data and unstructured data. Many to Many relationships are not a pretty sight anyways. 

I found SQL to be very limited in features and capabilties when it comes to storing hierachical data. 

All you can do is create child parent relationship and do recursive queries. As we all know 7 out of 10 times in a big application database is the first culprit and you need real experts to fine tune SQL Queries when you start feeling that application is not behaving upto  expectations and users are drifting away.

No SQL can be divided primarily in four categories

a.) Document Based (Mongo, Couch)

b.) Key Value (Redis, Memcache, Dynamo, Riak)

c.) Columnar Database (Cassandra, HBase)

d.) Graph (InfiniteDB, neo4J).

Out of these 4 graph database have a unique place and easier to decide at least in my experience. Whenever data is hierarchical and relations can not be modeled using RDBMS easily one can go for neo4J. Hierarchical data may require deep traversals and RDBMS definitely does not rock at this. 

Document databases are easiest to use and MongoDB is a sheer pleasure to work with. It gets up and running very easily and have most features compared to any other database when it comes to querying.

So I will divide this post into some headings

When to use No SQL

My answer is always. Hardly there is any application today which does not have unstructured data. Everybody wants to grow so it is most likely that sooner or later you are going to generate data that will be large. Be it from social media, your own click stream capture. Storing Web logs or whatever. You want lot of users to come to your site.  More the merrier so yes you will generate lot of data.

So having a polyglot persistence built in right from the beginning in application is gonna help you at later stage. 

It’s easier to define what kind of use cases No SQL is not a good fit rather than finding good use cases (except big data).

When you need strong ACID support (Financial information specifically). Payments, User registration then I will never think about storing these in a No SQL. Risk is just too great. 

Some people argue like one gentleman at the conference that amazon is using Dynamo for storing user cart information. May be it can be used. But I will not agree with this 100%. Reason is simple. All NoSQL databases are eventual consistent. That means due to replication there is a delay in syncing the data on multiple machines.

So when you run a  query you do not know which copy of data will be returned whether that’s latest information or old information. So if you use Mongo may be there is a chance in theory that user will see his old cart and not latest one and next time he looks at his cart he might be seeing latest one. I would not want this. So consider this use case out of scope for mongodb. 

Some NoSQL dbs like Riak provide vector clocks but they have their own problems

http://docs.basho.com/riak/latest/references/appendices/concepts/Vector-Clocks/

So one has to be very careful in such scenarios. 

Take another use of promotional campaigns. Lot of companies do promotional campaigns and they need to store these huge emails and they even track their performances then it is a definite use case for a NoSQL. Data is huge..it does not have to be transactional and if we loose some data due to some node failure we will not loose our job.

In No SQL world two principles are very prevalent

a.) Prefer redundancy over normalization. Disk is cheap theortecally infinite and No SQL due to in built horizontal scalability have no problem handling data. So when you have to optimize your query do not change your schema but you can store redundant data in separate table suitable for this query only

 Design schema for your queries. Write down use cases and design your data storage accordingly. Do not try to do otherwise as in SQL world.  

b.) Design your app for Consistency .relations/rules/data quality are all handled in application as NoSQL does not guarantee this. There are no joins and locks are at row level. 

Let’s look at some of the use cases for each database

MongoDB

It should be first choice by default..more so when you do not have much experience with No SQL world. Most close to RDBMS supported by excellent client drivers and easily integrated with any stack PHP, Java, Python, Node JS you name it

It can be used as a general purpose database. Supports secondary indexes. Shards easily

Only problem I find with MongoDB is versioning. I never know what version of data is going to be returned to me.

Neo4J

Most suitable for Social Graphs. Deep traversals. Recommendations. Implemented a subject hierarchy using this and traversals were damn fast. Provides excellent Apis in Java..supports REST. No other fully open source graphDB comes to close to this one.

Hypergraph is comparable but lets down in Apis compare to Neo4J especially traversals.

http://www.neo4j.org/

Column Oriented

Cassandra and HBase

Both are column based with minor differences here and there. Cassandra was developed by Facebook and became an Apache incubator later on. HBase sits on top of hadoop. 

Their use case I see is only one. When you have lots of data. Hundreds of TB to PB and you just want to do Map Reduce though Cassandra provides CQL. You will know when you have that much data.  

Key Value Pair

e.g Memcache/Redis – They are damn good at what they do. Primarily used in caching layer they can server data to your clients insanely fast. Can shard on hundreds of servers easily and redis even provides many useful Data Structures in built 

So here in short I just provided my experience with No SQL. 

Comments are welcome 

 

 

 

 

 

No SQL and it’s importance