Mar 062011
 

Cloudata logoCloudata is a new open source implementation of Google’s BigTable paper. It can be found on Github here. It appears to be the project of a Korean developer named YKKwon.

As noted at MyNoSQL, there are only a couple commits and it’s not clear how serious this project is. But it will be of interest to big data, MapReduce and BigTable buffs.

Cloudata differentiates itself from Hadoop by offering an indexed but still non-relational database, but is probably more comporable to HBase and Hypertable, which are also open source BigTable implementations. The project’s website claims Cloudata can retrieve data within a few milliseconds.

Here’s a list of the current features:

Basic data service

  • Single row operation(get, put)
  • Multi row operation(like, between, scanner)
  • Data uploader(DirectUploader)
  • MapReduce(TabletInputFormat)
  • Simple cloudata query and supports JDBC driver

Table Management

  • split
  • distribution
  • compaction

Utility

  • Web based Monitor
  • CLI Shell

Failover

  • Master failover
  • TabletServer failover

Change log Server

  • Reliable fast appendable change log server

Support language

  • Java, RESTful API, Thrift
Mar 062011
 
Have you been wanting to learn Hadoop, but have no idea how to get started? Carlo Scarioni has a basic Hadoop tutorial that covers installing Hadoop, creating a Hadoop Distributed File System (HDFS), moving files into HDFS, and creating a simple Hadoop application. The tutorial also introduces the basic concepts of Map Reduce.

It doesn’t, however, get into distributing the application, which is the main point of using Hadoop in the first place. Scarioni leaves that to a future tutorial. But if you want to get your feet wet with Hadoop and/or Map Reduce, this seems like a pretty good place to start.

also gives us a pretty concise explanation of what Hadoop is:

Hadoop is an open source project for processing large datasets in parallel with the use of low level commodity machines.Hadoop is build on two main parts. An special file system called Hadoop Distributed File System (HDFS) and the Map Reduce Framework.

The HDFS File System is an optimized file system for distributed processing of very large datasets on commodity hardware.

The map reduce framework works in two main phases to process the data. Which are the Map phase and the Reduce phase.

Oct 242010
 

High Performance Server? yeah. Its no wonder that we can setup high performance server.

Server – A computer running a service. yep. its true servers are nothing but computers running appropriate services.

So whats a server? Hardware or Software or both? What about the big 6 feet high machines we think as servers?

A server can be any computer running service(s).

You can make your laptop as a server or desktop or even mobile phone. but performance matters.

Servers performance depend on hardware and software. we can upgrade hardware to improve performance but upgrading software may fix bugs, add new features to it but no change in performance or can reduce performance.

Our major concern is choosing right software for the right job. There is no all in one software, even if there is it will be slow or buggy. We need to choose each and every piece of software to make a better system.

I will illustrate this with an example.

Requirement : We 10 people from 6 geographic locations intend to develop a social network which may serve 100 million users.

Analysis : Assume that we have best hardware, now focus on software. Software includes Operating System, Programming Language, Database Server, Web Server, SCM Server, Cache server. Our software need to be small memory footprint, bug/virus free, secure, fast.

If we see existing high performance implementations like hotmail, yahoo, facebook, twitter, google its like this,

  • Google : Python, C++, Plan9 (i think), Bigtable, GWS, GFS, mysql, …
  • Facebook : Linux, C++, C, PHP, Scribe, Thrift, Apache Cassandra, Hip hop, Tornado, Apache Hive, Varnish, Mysql, Memcached.
  • Hotmail : IIS, Windows server, .Net, …
  • Youtube : Memcached, …
  • Yahoo : PHP, Mysql, Free BSD, …
  • Twitter : Ruby, memcached, …

OS : CentOS or Ubuntu We have lot of options to choose in Operating Systems like Windows/Solaris/Unix/BSD/Linux/Mac/Plan9/AIX But Linux kernel is in active development than any other, fast/scalable/secure and nearly as bug/virus free as UNIX. Linux is having lot of flavors so we choose advanced and know version among them.

Programming Language : C, PHP, Javascript, Python, Haskel. We have several kinds of languages like Procedural, Object Oriented, Dynamic, Functional, Event Driven, Structured, Statically typed, Dynamically typed and few more. Each kind having its advantages. Normally we come to choose programming language according to our need/purpose rather than what we fancy. In general we have C, C++, Perl, Python, Ruby, .net(VB, C#, F#), Java, PHP, Haskel, Scheme, Javascript, Tcl,… C is the best to do low level stuff, PHP to play with http/HTML, Javascript to enrich presentation layer, python good at playing with text, heskel to do any exteded programming to take advantage of functional language.

DBMS : Mysql or apache cassandra. We choose fast and distributed database management systems from Mysql, MS Sql Server, Postgresql, Oracle, DB2, …

Web Server : nginx. A Webserver serve’s performance depends on how it process request and how it handles scripts(PHP, Ruby, Python, Java,….). We have Lots of Webservers like Lighttpd, nginx, Apache, Cherokee, IIS, Tomcat, Glash fish, Weblogic, mongrel, Webrick,….

SCM Server : Git. Git is the best of its kind when compared with Mercurial, Subversion, Bazaar, CVS, Bitkeeper and several others.

Cache Server : Varnish

In future I will try to screen cast the best setup i can.

Sep 242010
 

The challenges that MySQL faces in the cloud – questions of elasticity, synchronization, scalability – are often referenced as part of the arguments in support of NoSQL database alternatives. But today,Xeround announces a solution to the problems of MySQL in the cloud that is actually a SQL solution – a SQL database for the cloud.

Xeround launches its new database that is, according to CEO Razi Sharir, “the best of both worlds.” In other words, Xeround’s new database technology promises both the transactional and query capabilities of relational databases alongside the scalability of NoSQL ones.

Xeround was built based on MySQL Storage Engine Architecture and acts as a pluggable storage engine. As such, it provides many of MySQL’s features, but was designed as a virtual database, optimized for the cloud. The database handles multi-tenancy and auto-scaling and is self-healing. It guarantees always-on service during schema changes, resource modifications, and the scaling process. And for those currently working with MySQL databases, all, according to Sharir, “without changing a line of code.”

Xeround_ss.jpgFor those in its private beta, Xeround will host its “Database-as-a-Service” in its private data center or on Amazon EC2. Other cloud service providers will soon be available. The DBaaS implementation simply requires uploading the database’s schema and pointing apps to the new location. The price for the offering will be competitive with other RDBS and by Amazon’s database offering.

Sep 192010
 

A Yahoo! engineer has reached a new cloud computing milestone. Using Hadoop, Yahoo! cloud computing engineer, Tsz Wo (Nicholas) Sze has determined the computation of π (pi) to the two quadrillionth decimal. And it equals zero. It’s the most amount of bits ever calculated for pi.

This is a remarkable achievement and shows the power of distributed systems for crunching big data.

It’s also an example of how data is changing our infrastructure as we discover new ways to develop applications with greater computational capabilities.

As Yahoo! points out, pi is a mathematical constant that is used to calculate the area of a circle.

Sze’s discovery follows a long line of mathematicians who have sought a bigger piece of the pi, no pun intended of course. This effort took some significant firepower and Hadoop magic:

“The circumference of the visible universe can be calculated using a mere 39 or 40 decimal places, so what are Yahoos like Nicholas doing out at those outer reaches? For Nicholas, this was a project ideally suited to flex the muscles of Hadoop, the open source technology at the epicenter of big data and cloud computing. The computation took 23 full days and required 1000 different machines using Hadoop.”

The achievement shows the power of distributed systems and points to how Yahoo! has adapted its business to the research it has done using Hadoop.

Yahoo is one of the pioneers in using Hadoop. It has helped the company better define the context of its network and the personal preferences of the millions of people use Yahoo! every day.

What that means for Yahoo! is a better understanding of how to use the computational powers of cloud computing. It provides the opportunity to do all sorts of things. Yahoo! engineers, for instance, are learning how to speed up calculations and algorithms. That faster crunch time? Think of one of the world’s largest networks doing mathematics equations and you have something that equates to a whole new experience. It goes beyond new ways to find new songs or predict the price of apples in the commodity exchange.

Instead, this points to the entrance of a new world where our culture meets the power of big data. A society that is driven more by information than machine.

Sep 062010
 

BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs; it is currently not distributed or used outside of Google, although Google offers access to it as part of their Google App Engine.

BigTable development began in 2004 and is now used by a number of Google applications, such as MapReduce, which is often used for generating and modifying data stored in BigTable, Google Reader, Google Maps, Google Book Search, “My Search History”, Google Earth, Blogger.com, Google Code hosting, Orkut, YouTube, and Gmail. Google’s reasons for developing its own database include scalability, and better control of performance characteristics.

Design

BigTable is a fast and extremely large-scale DBMS. However, it departs from the typical convention of a fixed number of columns, instead described by the authors as “a sparse, distributed multi-dimensional sorted map”, sharing characteristics of both row-oriented and column-oriented databases. BigTable is designed to scale into the petabyte range across “hundreds or thousands of machines, and to make it easy to add more machines [to] the system and automatically start taking advantage of those resources without any reconfiguration”.

Each table has multiple dimensions (one of which is a field for time, allowing for versioning and garbage collection). Tables are optimized for GFS by being split into multiple tablets – segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size. When sizes threaten to grow beyond a specified limit, the tablets are compressed using the algorithm BMDiff and the secret algorithm Zippy, which is described as a less space-optimal variation of LZO but more efficient in terms of computing time. The locations in the GFS of tablets are recorded as database entries in multiple special tablets, which are called “META1″ tablets. META1 tablets are found by querying the single “META0″ tablet, which typically resides on a server of its own since it is often queried by clients as to the location of the “META1″ tablet which itself has the answer to the question of where the actual data is located. Like GFS’s master server, the META0 server is not generally a bottleneck since the processor time and bandwidth necessary to discover and transmit META1 locations is minimal and clients aggressively cache locations to minimize queries.

Related Implementations

Open source

  • HBase — Written in Java. Provides BigTable-like support on the Hadoop Core.
  • Hypertable — Hypertable is designed to manage the storage and processing of information on a large cluster of commodity servers.
  • Apache Cassandra — Facebook’s distributed storage system based on a Bigtable data model on Amazon’s Dynamo-like structure. Written in Java.
  • Neptune – Written in Java. Uses ZooKeeperHDFS of Hadoop.
  • KDIKosmix attempt to make a BigTable clone. Written in C++.

Google Fusion Tables

  • Google Fusion Tables was released on June 9, 2009, as an experimental system for data management in the cloud.
Sep 042010
 

Why can’t privacy and connectedness go hand-in-hand? That’s the question being raised by those behind the new Diaspora project, an ambitious undertaking to build an “anti-Facebook” – that is, a private, open source social network that puts you back in control of your personal data.

Envisioned by four NYU computer science students, the Diaspora project would replace today’s centralized social web (yes, they mean you, Facebook) with a decentralized one, while still offering something that’s convenient and easy for anyone to use.

According to the project’s homepage, the students, Daniel Grippi, Maxwell Salzberg, Raphael Sofaer, and Ilya Zhitomirskiy, “bonded over many late nights building a Makerbot,” (to you non-geeks, that’s a type of robot) and they “started discussing what a distributed social network would look like.”

The end result of those discussions was the idea for Diaspora. So they stopped talking about it and started building.

The project is now hosted on Kickstarter.com, a social fundraising platform that lets entrepreneurs and other creative types crowd-source funding by setting up a project goal, deadline and optional set of rewards for project backers.

In Diaspora’s case, they’re less than $2000 short of their $10,000 goal with under a month left to go until reaching their deadline. If the project receives the necessary level of funding by June 1st, it will be built and the code released as free software using theaGPL open-source software license.

What is a Decentralized Social Network?

So what is Diaspora anyway? Instead of being a singular portal like Facebook, Diaspora is a distributed network where separate computers connect to each other directly, without going through a central server of some sort.

Once set up, the network could aggregate your information – including your Facebook profile, if you wanted. It could also import things like tweets, RSS feeds, photos, etc., similar to how the social aggregator FriendFeed does. A planned plugin framework could extend these possibilities even further.

Your computer, called a “seed” in the Diaspora setup, could even integrate the connected services in new ways. For example, a photo uploaded to Flickr could automatically be turned into a Twitter post using the caption and link.

When you “friend” another user, you’re actually “friending” that seed, technically speaking. There’s not a centralized server managing those friend connections as there is with Facebook – it’s just two computers talking to each other. Friends can then share their information, content, media and anything else with each other, privately using GPG encryption.

Diaspora, the Turn-Key Solution

Because not everyone will be technically capable of (or interested in) setting up their computer to function as a “seed,” there are plans to offer a paid turn-key service too, similar to WordPress.com, the blogging platform. WordPress itself is software you can install and configure on your own server, if you’re inclined to do so, but if you’re less technically-savvy, you can opt to quickly start a blog via WordPress.com instead. Diaspora would function in a similar way.

If a lot this sounds reminiscent of Opera’s Unite project, the Web browser maker’s overly-hyped plan to “reinvent of the Web,” it should. In Opera Unite, users can share documents, photos, music, videos and run websites and chat rooms by directly linking two computers together.

However, in Unite’s setup, there are Opera-run proxy servers involved, which led to issues – especially when those servers went down. Diaspora wouldn’t have that problem.

Mainstream Success?

Still, the concepts behind Diaspora, while the sort of thing tech geeks will eat up, may be harder to grasp for the everyday Facebook user who is still trying to figure out how post a link or video to their Wall. Distributed, decentralized, open-source what?

If Diaspora is realized, it will be up to technology advocates to position the turn-key service in a way that will make it sound simple and appealing to precisely those sorts of mainstream users if it is to ever succeed. Taking shots at Facebook’s privacy issues may be a good course (Take back control with Diaspora!).

We would like to see Diaspora come to be, even if it never goes mainstream because it would finally offer privacy advocates a real alternative to the increasingly data-hungry Facebook.

Sep 042010
 

Researchers in South Korea have built a networking router that transmits data at record speeds from components found in most high-end desktop computers. A team from the Korea Advanced Institute of Science and Technology created the router, which transmits data at nearly 40 gigabits per second–many times faster than the previous record for such a device.

The techniques used by the researchers could lead to a number of breakthroughs, including the use of cheaper commodity chips, such as those made by Intel and Nvidia, in high-performance routers, in place of custom-made hardware. The software developed by the researchers could also serve as a testbed for novel networking protocols that might eventually replace the decades-old ones on which the Internet currently runs.

Most routers use custom hardware to route data as it passes between computer networks. Software routers perform the same tasks using commodity hardware–by mimicking the behavior of a hardware router in software. Commercial software routers from companies such as Vyatta can typically only attain transfer data at speeds of up to three gigabits per second. That isn’t fast enough to take advantage of the full speed of a typical network card, which operates at 10 gigabits per second.

“We started with the humble goal of being the first to get a PC router to 10 [gigabits per second], but we pushed it to 40,” says Sue Moon, leader of the lab in which the research was conducted. Her students Sangjin Han and Keon Jang developed software called PacketShader that made this possible. PacketShader uses a computer’s graphics processing unit (GPU) to help process packets of data sent across a network.

Modern routers are rarely dumb switches anymore. They are often called upon to manipulate packets in a number of different ways as they pass through. GPUs are ideal for this purpose because they can process data in parallel, which means they can handle several packets of data at once. According to Moon, a GPU is much faster at handling some packet-processing tasks, such as authenticating or encrypting all of the packets in a stream. When the GPU takes over these tasks, it gives the central processing unit (CPU) breathing room to handle other things that are more serial in nature, such processing several packets in turn to detect attempts to break into a network.

Mark Handley, a professor of networked systems at University College London, points out that for basic packet forwarding, which isn’t likely to overwhelm a computer’s CPU, there is no advantage to strapping the GPU onto the system. However, he agrees that the GPU is very well suited to encrypting or authenticating packets.

Gianluca Iannaccone, an engineer at Intel Labs Berkeley who is familiar with PacketShader, says it could slash the number of physical machine needed to comprise a terabit-per-second software router to one-third of what his research has previously indicated would be required.

“One terabit is the entry point for enterprise-grade routers–the routers in the core of the Internet,” says Iannaccone. His work on a system called RouteBricks points to a future in which routers aren’t the specialized hardware they are now, but instead function as software running on pools of servers. Lash enough software routers together that run at 40 gigabits per second, and you get what is essentially a single-terabit router. Using such a system, routers might some day run completely in software.

“We can expect killer apps out of this,” says KyoungSoo Park, another professor at the Korea Advanced Institute of Science and Technology who was involved with the project. “You can build an interesting packet- or network-management system on top of a PC-based software router that can’t be implemented with a hardware router. Ultimately, you can experiment with new protocols that are not used in today’s Internet.”

Sep 042010
 

Cassandra is an open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powers their Inbox Search feature. Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure.

Cassandra provides a structured key-value store with eventual consistency. Keys map to multiple values, which are grouped intocolumn families. The column families are fixed when a Cassandra database is created, but columns can be added to a family at any time. Furthermore, columns are added only to specified keys, so different keys can have different numbers of columns in any given family. The values from a column family for each key are stored together, making Cassandra a hybrid between a column-oriented DBMS and a row-oriented store.

History

Cassandra was developed by Facebook to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon’s Dynamo) and Prashant Malik (Facebook Engineer). It was released as open source project on Google code in July 2008. In March 2009, it became an Apache Incubator project. On February 17, 2010 it graduated to a top-level project.

Features

Decentralized

Every node in the cluster is identical. There is no single point of failure.

Fault-tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Tunable consistency

Writes and reads offer a tunable ConsistencyLevel, all the way from “writes never fail” to “block for all replicas to be readable,” with the quorum level in the middle.

Elasticity

Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

Data model

A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured. The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long. Every operation under a single row key is atomic per replica no matter how many columns are being read or written into. Columns are grouped together into sets called column families very much similar to what happens in the BigTable system. Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family. The top dimension in Cassandra is called Keyspace.

Therefore Cassandra can handle maps with 4 or 5 dimensions:

Map with 4 dimensions:

  1. Keyspace -> Column Family
  2. Column Family -> Column Family Row
  3. Column Family Row -> Columns
  4. Column -> Data value

Map with 5 dimensions:

  1. Keyspace -> Super Column Family
  2. Super Column Family -> Super Column Family Row
  3. Super Column Family Row -> Super Columns
  4. Super Column -> Columns
  5. Column -> Data value

The first dimension – Keyspace -> (Super) Column Family – is limited to a (small) set of schema-predefined keys. Cassandra documentation mentions similarities of this dimension with the table definition inside a relational database.

Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family. The system allows columns to be sorted either by time or by name. Time sorting of columns is exploited by applications like Facebook Inbox Search where the results are always displayed in time sorted order. Any column within a column family is accessed using the convention column_family : column and any column within a column family that is of type super is accessed using the convention column_family : super_column : column.

Typically applications use a dedicated Cassandra cluster and manage them as part of their service. Although the system supports the notion of multiple tables all deployments have only one table in their schema.

Prominent users

  • Facebook uses Cassandra to power Inbox Search, with over 200 nodes deployed.
  • Digg, the largest social news website, announced on Sep 9th, 2009 that it is rolling out its use of Cassandra and confirmed this on March 8, 2010
  • Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time. However, these plans were subsequently scaled back or scrapped
  • Rackspace is known to use Cassandra internally
  • Cisco‘s WebEx uses Cassandra to store user feed and activity in near real time
  • IBM has done research in building a scalable email system based on Cassandra
  • Reddit switched to Cassandra from memcacheDB on March 12, 2010 and experienced some problems with overload handling in Cassandra in May
  • Cloudkick uses Cassandra to store the server metrics of their users
Sep 032010
 

wave_logo_sep09.jpgGoogle Wave is far from dead, and developers, early adopters and enterprises will be glad to hear it. Today Google announced it will expand on the code it has already open sourced, building Wave into a functional application that will allow users to run wave servers, host their own waves and build bigger and better applications with the real-time collaboration technology.

“Since the beginning, it has been our vision that the Google Wave protocols could support a new generation of communication and collaboration tools,” engineer Alex North wrote on the Google Wave developer blog.

Google had big plans for Wave – it was supposed to replace email and the killer app among Google Apps, but the company basically gave up on the project earlier this month. It still seems possible for the technology to do big things. But it sounds like Google plans to move on after the application is released. The future of the open source project will be defined by developers’ contributions, North wrote.

Wave In A Box will not have the “full functionality” of the Gmail-integrated Web app, but it will feature threaded conversations and support importing data from wave.google.com.

The release of Wave as an application could be a big deal for developers, especially at businesses that want to take advantage of Wave for real-time collaboration and discussion.

Wave was always intended for people to run on their own machines, but this release will make the process much easier. Developers and enterprise users that have been eyeing Wave will be more likely to take the technology into their own hands and build things like feature-rich Web forums, productivity tools and apps to facilitate collaborative projects.

Google did not specify a timeline for the release of the new code, but it has said thatwave.google.com will be available at least through the end of the year.