December 03, 2014
It is your worst nightmare, Ambari loses touch with all of the nodes on your cluster. You can no longer manage the Hadoop processes across the cluster via the GUI. So you check the ambari-server log to find out why it is no longer talking to the agents. In the log you find an error like this;
WARN nio:651 – javax.net.ssl.SSLException: Received fatal alert: certificate_expired
So following suggestions on Ambari's Apache wiki page you attempt to delete the existing certificates and regenerate new ones via openssl. It is on the official wiki, it has to be right? Once you get around the certificate pass not being default you get a new error;
WARN nio:651 – javax.net.ssl.SSLException: Received fatal alert: unknown_ca
Things are not looking good. Luckily it is very easy to recover from this error, and actually very fast if you have the right tools installed. First stop all of the agents and the Ambari server. Once everything is down delete the agent certificates located here;
Using a tool such as pdsh makes doing this fast and easy. Once the keys are deleted you can start all of the agents back up.
Now delete the keys on the server located here;
echo “” > /var/lib/ambari-server/db/index.txt
Once those files are backed up and removed you can start the Ambari server and it will regenerate new keys and be able to communicate with all of the agents again.
By default it generates keys good for one year, so expect to be doing this once a year if you aren't doing a clean install with new releases.