What is the cloud? - Cloud definition

"What actually is the Cloud?" - I keep hearing this question everyday. The problem is that there is no single, official definition. Actually, there are as many definitions as people talking about that. The problem is that when we started using this term several years ago it had not clarified yet what does it mean and it meant a lot of different things. However, this period is over!

So what actually does the Cloud mean? I spent some time on clarifying it as a part of my PhD thesis and I prepared my own definition based on what I found in a number of books, articles and online resources. So basically it proceeds as follows:

Cloud is an ICT (Information and Communications Technology) infrastructure which is characterized by the following features:

  • it is completely transparent to users
  • it delivers a value to users in a form of services
  • its storage and computing resources are infinite from users' perspective
  • it is geographically distributed
  • it is highly available
  • it runs on the commodity hardware
  • it leverages computer network technologies
  • it leverages virtualization and clustering technologies
  • it is easily scalable (it scales out)
  • it operates on the basis of distributed computing paradigm
  • it implements "pay-as-you-go" billing
  • it is multi-tenant

Phew ;). Obviously not all clouds can be characterized by all of these features, but the most of them can. What actually is the Cloud then? I hope you will be able to answer quickly next time!

MapReduce Explained


Have you ever thought how does Google Search engine - the core Google product that brought up the company to the position of one of the biggest, if not the biggest, leaders on ICT market - work? It all came up thanks to MapReduce data processing framework. Although nowadays Google Search leverages much more powerful engines like BigTable and Caffeine, it has it origins in MapReduce.

It's everywhere around nowadays. Search engines, bank systems, colaboration platforms - they're all using MapReduce. It's de facto standard for processing and analysing big data sets. Any time you'll hear about Big Data, you'll also hear about MapReduce.

But what MapReduce actually is? According to "MapReduce: Simplied Data Processing on Large Clusters" - an official publication made by Google - a MapReduce "is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key". Although you could just read the research paper, that I do encouraged you to do anyway, I believe the best way to understand it is just to cover it based on an simple example. Lets go into the next step then where I'll show you a real, live problem where MapReduce comes with the simplest and lowest cost solution.


I live in Krakow, Poland that is a mid-sized EU city. Lets consider the following problem:

"I need a list of all streets in Krakow, both with the highest house number on particular street. The above needs to be completed in 2 days".

Sounds unworkable. I would need to visit each street, search for a house with the highest number, note it down and so on until I'll have a full list. As the overall length of streets in Krakow is around 1200 km, assuming that I would walk 15 km per day (I am not type of a sportsman and need to spend some time on looking for a house with the highest number too) the above will take me 80 days. Is there any way to accomplish the job then? The answer is: Yes, there's a solution for that and it's called MapReduce! Lets start with the Map step then.


First of all you've forgotten that I have a friends that will help me with that! I have 39 friends, so there are 40 of us. If we split the job that each of us walks 30 km per day, we will all walk 1200 km per day. That means we all could walk the whole Krakow in 1 day, so we would complete the job even before the deadline!

OK. But I said that I could walk 15 km per day only and lets assume that it's indeed an average of all of us. I also mentioned that this comes from both our bodies limitations - performance - and due to the fact that we need to spend some time on looking for a house with the highest number - processing time. I doubt we would be able to bypass the performance, but could we speed up processing time a bit?

Of course we could. Instead of looking for a house with the highest number on particular streets, we could just pass the street quickly and note down its name - key - and each house number - value. As the house number are placed in the best visible places, we would even not need to stop. Lets assume that thanks to the above we save so much time that would enable us to double the distance that we could walk, up to 30 km.

We all would then walk the whole Krakow in 1 day. But will the job be completed after that? Not yet as we would not have a list of streets with the highest house number, but a list of streets with all the house numbers on them instead. This is where the Reduce step comes.


So far we've all spent 1 day on completing the job and we're almost done. The only task that is left is to sort the house numbers on particular streets and choose the one with the highest number. Lets assume that it's done just be me and it takes me 1 day. The job will be completed on time then!

But why did we hurry so? Couldn't we just walk those 15 km per day and have the job completed in 2 days anyway, without performing the Reduce step? In each case I'll need to spend 2 days on the job.

That's right, but you might have forgotten about my friends. They won't be involved in the Reduce step, so they'll have a day off and can help someone else, e.g in Warsaw (the capital of Poland). I'm not going to hold them anymore. Each of them did a really great task for me and we all did a really great job. Obviously once I'll sit and manually sort the house numbers that actually sounds like a nightmare ;).

Summary and Explanation

MapReduce is a framework for processing and generating large data sets in a quick and effective manner thanks to distributed computing paradigm. The above example shows that concept in a direct and understandable way. There's a job that gets split into 40 Map tasks and 1 Reduce task. Each task is executed either by me or my friend - tasktrackers - while the whole jobs is executed and coordinated by me only - jobtracker. After each task a list of key - value pairs gets compiled that finally leads to an ultimate one. Simple Map functions enables to speed up processing time. Assigning only one tasktracker to the Map function allows to free up the resources of the remaining tasktrackers.

In a real MapReduce framework jobtrackers and tasktrackers are the computer instances that cooperate together in a distributed computing paradigm. Map and Reduce are the functions written in Java, Python or actually any other language enable for data processing. The whole engine gets coordinated by a dedicated software like Apache Hadoop or proprietary software like in case of Google.

Starting Hadoop datanode: Error: JAVA_HOME is not set and could not be found.- Issues with CDH


CDH (Cloudera's Distribution including apache Hadoop) is the most popular and the best documented distribution of Apache Hadoop. I have recently found out some deficiencies in its documentation when following the CDH4 Quick Start Guide instructions. I installed the Oracle Java Development Kit and set up JAVA_HOME environmental variable according to the instructions, but when attempting to start HDFS nodes I was receiving an error message stating that JAVA_HOME is not set and could not be found. After a quick research I have finally found out that a solution for that is just to export JAVA_HOME inside hadoop-env.sh configuration file in addition to .bash_profile file. The above solution comes very quickly for an experienced Hadoop administrator, but can be tricky for a beginner, so should be well documented by Cloudera in my opinion. The following covers detailed troubleshooting steps both with a solution.


1) You have the Oracle Java Development Kit installed and JAVA_HOME environmental variable exported according to the following HowTo:

[root@hadoop-standalone-mr1 ~]# env | grep JAVA_HOME

2) When attempting to start HDFS nodes you are receiving the following error messages:

[root@hadoop-standalone-mr1 ~]# for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do service $x start ; done
Starting Hadoop datanode:                                  [  OK  ]
Error: JAVA_HOME is not set and could not be found.
Starting Hadoop namenode:                                  [  OK  ]
Error: JAVA_HOME is not set and could not be found.
Starting Hadoop secondarynamenode:                         [  OK  ]
Error: JAVA_HOME is not set and could not be found.

How to fix the issue

1) Export JAVA_HOME environmental variable in hadoop-env.sh configuration file:

echo export `env | grep ^JAVA_HOME` >> /etc/alternatives/hadoop-conf/hadoop-env.sh

2) You should be fine. All HDFS nodes start up properly now:

[root@hadoop-standalone-mr1 ~]# for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do service $x start ; done
Starting Hadoop datanode:                                  [  OK  ]
starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-hadoop-standalone-mr1.out
Starting Hadoop namenode:                                  [  OK  ]
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-hadoop-standalone-mr1.out
Starting Hadoop secondarynamenode:                         [  OK  ]
starting secondarynamenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-hadoop-standalone-mr1.out


  • The above has been tested on CDH4 package, on CentOS 6.4 x86_64 system, in Google Compute Engine environment.
  • The above solution works both for MRv1 and YARN.

How to connect Thunderbird to Exchange - DavMail Server


Mozilla Thunderbird, one of the most popular email clients, still suffers from one serious disease: it does provide neither built-in mechanisms nor third-party plugins for RPC MAPI connection to MS Exchange Server. As an email services administrator I always used to support my customers by enabling direct SMTP / IMAP connection then. However as my company has recently changed its security policy and decided to block raw SMTP / IMAP access to our MS Exchange infrastructure I was forced to find out an alternative solution for my Thunderbird users. After hours spent on digging for a best possible solution I have finally found DavMail. It occurred that it is some kind of proxy, written in Java, that runs SMTP / IMAP servers locally and connects to MS Exchange via OWA. I managed to run DavMail in a server mode on a standalone VM. After that I provided my Thunderbird users with the VM details and now I have all of them connected to company MS Exchange infrastructure. The following HowTo presents details steps describing how did I achieve that.


To set up the DavMail in a server mode on a VM follow the instructions below:

1) Set up a VM with Ubuntu Server 12.04 64-bit with X Server (Unity preferably).

2) Install OpenJDK and SWT by running the following command:

# apt-get install openjdk-7-jre libswt-gtk-3-java

3) Download the newest available version of DavMail and install it both with required dependencies by issuing the following command:

# dpkg -i davmail*.deb
# apt-get -f install


1) To run DavMail in a server mode create /etc/davmail directory and put davmail.properties file there. Adjust the following settings to fit into your organization requirements:


The most important ones are davmail.url that indicates your OWA URL and davmail.allowRemote that you need to turn to true to support server mode. Moreover in my case I also disabled POP server and changed davmail.caldavPort, davmail.imapPort, davmail.ldapPort and davmail.smtpPort values into regular port numbers of HTTPS, IMAPS, LDAPS, and SMTPS services respectively.

2) As all of the HTTPS, IMAPS, LDAPS and SMTPS services run over TLS you will need a certificate in PKCS12 format attached. To generate it, assuming that you have the following in a PEM format: CA.pem, server.pem and server.key run the following command:

openssl pkcs12 -export -in server.pem -inkey server.key -certfile CA.pem -out server.p12

Alternatively you can create a self-signed certificate or not attach it at all. Your setup will not be secure then, so it is highly recommended to use TLS anyway.

3) Adjust the following settings of davmail.properties configuration file:


Running DavMail Server

To run DavMail add the following line into the /etc/rc.local script before the exit 0 line:

nohup /usr/bin/davmail /etc/davmail/davmail.properties &

After that you will notice that you VM starts listening on TCP ports 443, 465, 636 and 993. Follow the instructions on official DavMail website to configure your Thunderbird client.

Type: VPN Subtype: encrypt Result: DROP - asymmetric ACLs on Cisco IPsec VPN ASA edges


I have recently encountered some strange issue as for the Cisco IPsec VPN between two sites of my organization. I had the VPN SAs established and proper ACLs permitting a desired subset of a traffic attached to the crypto map, but the traffic was not able to pass anyway. Finally, it has been revealed that this was happening because of asymmetric ACLs on the neighboring Cisco ASAs.

Lets now have a look on a simplified configuration I had. On the site A I had subnets from network. On the site B I had subnets from network. Cisco ASA on the site A was running Cisco Adaptive Security Appliance Software Version 8.2(5) while Cisco ASA on the site B was running version 7.2.(4) (I suppose that the issue might be related to software versions incompatibility, a bug in a certain software version, etc.). Related part of configuration on each of the Cisco ASA instances was as follows:
  • site A:
object-group network A-site-subnets
object-group network B-site-subnets
access-list A2B extended permit ip object-group A-site-subnets object-group B-site-subnets
crypto map A-map 1 match address A2B
  • site B:
object-group network B-site-subnets
object-group network A-site-subnets
access-list B2A extended permit ip object-group B-site-subnets object-group A-site-subnets
crypto map B-map 1 match address B2A


To quickly test whether the communication issues are caused by asymmetric ACLs on the neighboring Cisco ASA edges, run the packet-tracer command on any instance specifying the parameters that should result in an ALLOW decision:

A-site-ASA# packet-tracer input inside icmp 0 0
Phase: 11     
Type: VPN
Subtype: encrypt
Result: DROP
Additional Information:

input-interface: inside
input-status: up
input-line-status: up
output-interface: outside
output-status: up
output-line-status: up
Action: drop
Drop-reason: (acl-drop) Flow is denied by configured rule

If the command results in an output like the above, you can safely move to the following section.

How to fix the issue

To get rid of the above errors redesign A-site-subnets and B-site-subnets object groups, and as a result the A2B and B2A ACLs, that they either include the particular subnets or the whole network summary. To save my time I have chosen the second approach:
  • site A:
object-group network A-site-subnets
  • site B:
object-group network B-site-subnets

A this point the traffic should be able to pass between Cisco ASA instances.

The DFS replication service stopped replication on the replicated folder at local path ... - complex DFS issues


I have recently performed an upgrade of the DFS infrastructure at my company that consists of 2 servers from which one of them is a master and the other is a slave for DFS Replication service. As I needed to replace the disks on the slave node and had lost the replicated data permanently as a result, I configured the DFS services from scratch then and started the replication over again. Unfortunately, after a long period of time spent on wondering whether the data are being replicated or not, I have finally found the following message marked as Warning in Event Viewer on the master node:

"The DFS replication service stopped replication on the replicated folder at local path ... "

The following article presents how have I bypassed the above issues thanks to the articles on Technet and one of the Internet blogs. I have also included some of my own augmentations into the provided solutions. Hope you'll find those information useful and consolidated. 

How to fix the issue

Following the instructions on Technet:

1) Stop and disable DFS Replication service.

2) Go into the drive containing replicated folder and make sure that the following folder options are set as follows:
  • Show hidden files, folders, and drives - ENABLED
  • Hide protected operating system files (Recommended) - DISABLED

3) On the Security tab of System Volume Information folder Properties add the user that you're currently logged in with Full Control permissions and the scope of This folder only.

4) Go into System Volume Information folder and on the Security tab of DFSR folder Properties add the user that you're currently logged in with Full Control permission and the scope of This folder, subfolders and files. Make sure that permissions get propated to all child items of the folder.

5) Remove the DFSR folder. If the above results in the Source Path Too Long error like it's shown on the following screenshot:

The source file name(s) are larger than is supported by the file system. Try moving to a location which has a shorter path name, or try renaming to shorter name(s) before attempting this operation.

perform the following steps (thanks to this blog):
  • create TEMP folder in the C drive root
  • run the following command
    robocopy C:\TEMP [DFSR folder path] /MIR
  • remove both TEMP and DFSR folders, this time fortunately without the above error.

6) Remove the user that you're currently logged in from Security tab of System Volume Information folder Properties.

7) Enable the DFS Replication service and start it back. You're done. The DFS replication starts over again!

DFS folder deeply hidden, invisible


It's another time when I'm working on Windows Server 2012 and discover something really odd and poorly documented. This time it was about DFS replication. I set it up between two servers in two remote AD sites. All according to regular manuals. It was working fine: the folder was claimed to be replicated, there were no issues reported by the Event Viewer and I was able to mount the share associated with the replicated folder on the remote server  So what was wrong? The folder was invisible in the file system on the remote server.

I was able to access it in the Windows Explorer only when specifying a full path into it. The content was just displayed fine then. The same as for the command line, even when running it as the Administrator user. After disabling the DFS replication the folder was still invisible. So what was wrong? After deep troubleshooting and looking for a solution in the Internet I finally found a clue that the DFS replicated folders have hidden and system attributes set up by default. That's why the don't appear in the filesystem! That's what's called deeply hidden folders.


In order to make the DFS replicated folders visible in the filesystem back, type the following command from the Windows CLI:

attrib -r -h -s [path to the folder]

Well done! The folder is visible in the filesystem back. Just not sure why isn't it clearly stated and documented? I hope that the above quick CaseStudy will help some abashed sysadmins like me to save a lot of time spent on troubleshooting strange behavior of Microsoft products.