II. Installing and configuring Git LFS locally
III. Using SSH with your Git repository
IV. Configuring and launching an Amazon EC2 instance
V. Accessing an Amazon EC2 instance in the browser
VI. Accessing an Amazon EC2 instance via SSH
VII. Cloning a Git repository on your Amazon EC2 instance
VIII. Installing and configuring the AWS Command Line Interface (CLI)
While a pleasant side effect of running large Bayesian models that strain your laptop to its breaking point for hours on end is ample amounts of free time to pursue other worldly pursuits, there reaches a point where it is necessary to don one’s big person pants and move all this heavy computation to the cloud. The goal of this tutorial is to (1) configure the hardware and software of an Amazon EC2 instance that we then create, (2) connect to this instance in order to clone a Git repository, run an R script (hereafter called GenericEC2Run.R
), and save the R output, then (3) publish these changes to your project’s remote Git repository as a new branch, before (4) terminating the instance. In our case, the R script GenericEC2Run.R
will be an R markdown file that runs a JAGS model, but it could be anything you wish. Let’s get started!
You will need Git LFS if you plan to commit any files \(>=\) 100 MB in size. The MCMC output from models with a large number of parameters could easily exceed this size, making LFS a must. This section describes installing Git LFS locally. Later on we describe how to install Git LFS on your EC2 instance, as it does not come pre-installed with the AMI we will be working with. Remember, Git LFS is a subscription service, and you first need to set this up on your GitHub account. If you don’t need Git LFS then you can skip this section and the code below where we install Git LFS on our Amazon EC2 instance.
Getting Git LFS to work correctly on the mac can be confusing, since git is typically installed in /usr/bin
, while homebrew installs Git LFS in usr/local/bin
. Unfortunately, both Git and Git LFS need to be in the same directory to work. In addition, you need to set the PATH so that your shell looks in these directories when git is invoked. For a brief foray into PATH enlightenment see Chris Bednarski’s blog post. Lastly, you need to commit a .gitattributes file to your Git repository specifying which files should be handled using git LFS. Got all that? Here are the steps. Make sure you do not have any large files ($>=$100 MB) commits before doing this!
Install homebrew from the terminal.
bash /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)
Install Git using homebrew.
brew install git
Install Git LFS using homebrew.
brew install git-lfs
Make sure that Git and Git LFS are both located in the same directory, your /usr/local/bin
.
Add /usr/local/bin to your path.
echo 'export PATH="/usr/local/bin:/usr/local/sbin:~/bin:$PATH"' >> ~/.bash_profile
Type which git
in the terminal to make sure your shell defaults to the git installation in /usr/local/bin
.
If using Tower, select this version of Git under Preferences, Git Config:
Before you do any commits involving large files, either create or amend the .gitattributes file to associate a file type with Git FLS. In our case this will be all rda files so we type:
git lfs track *.rda
Commit this change to .gitattributes and push it to the remote Git repository.
We will now generate an SSH key (two files - a public key that you share with the world and a private key you keep safe) that we will associate with our Git account. This will allow us to clone our Git repository on an EC2 instance without having to manually type in your username and password or (worse yet) put your password in cleartext when using a script.
In terminal create an SSH key, substituting your email address.
SSH-keygen -t rsa -b 4096 -C cccweb@icloud.com
Save the key to the default directory, ~/.ssh
.
Skip entering a pass-phrase.
Check that the public and private key are in ~/.ssh
by going to the directory and typing ls -l id_rsa*
. You should see two files, the public key named id_rsa.pub
and the private key named id_rsa
.
-rw-r--r-- 1 coldwater staff 3243 Mar 15 10:19 id_rsa
-rw-r--r-- 1 coldwater staff 743 Mar 15 10:19 id_rsa.pub
From the terminal, make sure this private key is not publicly viewable.
chmod 600 ~/.ssh/id_rsa
Check that this worked by typing ls -l id_rsa*
. Notice that now the private key can only be read and written to by me, while the public key can still be read by everyone.
-rw------- 1 coldwater staff 3243 Mar 15 10:19 id_rsa
-rw-r--r-- 1 coldwater staff 743 Mar 15 10:19 id_rsa.pub
Go to the settings under your GitHub account and then click SSH keys
and New SSH key
.
In terminal copy your public key to the clipboard.
pbcopy < ~/.ssh/id_rsa.pub
Paste this into the key box on github and click save. This key is available to ALL your Git repositories.
OK, let’s configure and launch our Amazon EC2 instance. We can access it via the browser, as well as SSH from the terminal. An instance is a virtual machine that runs an AMI, which specifics its software configuration. You can start with a bare-bones Ubuntu image and install R, etc, but that is time consuming. Fortunately, there are many community AMIs available that are configured for different tasks. Louis Aslett has created an RStudio Server AMI that we will use.
Go to the EC2 Dashboard under the AWS Management Console.
Make sure the region selected in the top right is as close to you as possible. If not change it.
Click launch instance.
Click Community AMIs, and type in ami-d1e792b1
. This is the RStudio Server AMI matching my region (Northern California).
Select Free-Tier eligible to get started. Later as you do more intensive runs, you can alter these hardware settings.
Under configure instance details, select terminate
under shutdown behavior. Leave the other options as is.
Leave storage and tag instance options as is.
Under security group, click add rule, and select HTTP. These rules are what allow you to connect to your instance via the web (port 80) or via SSH (port 22). Don’t worry about the warning. Nobody can access this instance without your private key, which only you have.
Click launch instance and select create a key pair. Name and download your private key. Remove the .txt extension from the private key. Move the key to ~/.ssh
by typing in the terminal:
mv /Users/coldwater/Downloads/CheCastaldoAmazon.pem ~/.ssh/CheCastaldoAmazon.pem
In the terminal, make sure this private key is not publicly viewable.
chmod 400 ~/.ssh/CheCastaldoAmazon.pem
Check that this worked by typing ls -l CheCastaldoAmazon.pem
. Notice that access to this file is even more restricted than the Git private key.
-r--------@ 1 coldwater staff 1692 Mar 15 11:42 CheCastaldoAmazon.pem
Click view instances. From this screen you can control your instance, as well as make changes to it. Note, stopping your instance is not the same as terminating it. If you want to kill an instance, you must use terminate.
Once you instance state is green (running) you can access it from the terminal or through the web browser.
rstudio
and the password rstudio
. Welcome to your Amazon EC2 instance! To access the command line here, click shell, under tools.Here we access your instance via SSH from the terminal. Why do this? Simple - script automation and scalability.
First we will, secure copy our Git Hub private key to the EC2 instance. That way, it is there waiting for us when we log in and we can use it to securely clone our Git repository. To do this get the Public DNS for our instance from the EC2 console. When using the secure copy (scp) or SSH command you will first need to specify your Amazon key pair and then the user name (ubuntu) and the public DNS of the instance you wish to connect to (ubuntu@public DNS). In the terminal and type:
scp -i ~/.ssh/CheCastaldoAmazon.pem ~/.ssh/id_rsa ubuntu@ec2-52-35-131-165.us-west-2.compute.amazonaws.com:~/.ssh/
Now, from the terminal, SSH into your instance.
ssh -i ~/.ssh/CheCastaldoAmazon.pem ubuntu@ec2-52-35-131-165.us-west-2.compute.amazonaws.com
After which you will say yes and see something similar to the following. Again, welcome to your Amazon EC2 instance!
The authenticity of host 'ec2-52-35-131-165.us-west-2.compute.amazonaws.com (52.35.131.165)' can't be established.
ECDSA key fingerprint is SHA256:iwfo8uhyBDJAthQQjxWnkcRSKK9XqE+0hSmr96zLHRQ.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-52-35-131-165.us-west-2.compute.amazonaws.com,52.35.131.165' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information as of Tue Mar 15 16:45:10 UTC 2016
System load: 0.52 Memory usage: 5% Processes: 81
Usage of /: 44.3% of 9.71GB Swap usage: 0% Users logged in: 0
Graph this data and manage this system at:
https://landscape.canonical.com/
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
ubuntu@ip-172-31-30-91:~$
Type ls -l ~/.ssh
and you should see your Git private key, with appropriate permissions, saved in the ~/.ssh
folder.
-rw------- 1 ubuntu ubuntu 399 Mar 15 16:45 authorized_keys
-rw------- 1 ubuntu ubuntu 3243 Mar 15 18:32 id_rsa
Sometimes as this point there is a permissions error, where the user ubuntu (you) is not the owner of the home directory. This make it impossible to clone your Git repository there. To check this type:
cd /home
cd ..
ls -l
Looking below, I see that root is the owner of home.
ubuntu@ip-172-31-30-91:/$ ls -l
total 76
drwxr-xr-x 2 root root 4096 Dec 18 20:55 bin
drwxr-xr-x 3 root root 4096 Dec 18 20:55 boot
drwxr-xr-x 13 root root 3820 Mar 15 16:45 dev
drwxr-xr-x 99 root root 4096 Mar 15 16:45 etc
drwxr-xr-x 5 root root 4096 Mar 15 20:43 home
To fix this I type:
sudo chown -R ubuntu:ubuntu /home
And now, ls -l
shows:
ubuntu@ip-172-31-30-91:/$ ls -l
total 76
drwxr-xr-x 2 root root 4096 Dec 18 20:55 bin
drwxr-xr-x 3 root root 4096 Dec 18 20:55 boot
drwxr-xr-x 13 root root 3820 Mar 15 16:45 dev
drwxr-xr-x 99 root root 4096 Mar 15 16:45 etc
drwxr-xr-x 5 ubuntu ubuntu 4096 Mar 15 20:43 home
FYI, to logout, type exit
. Note this doesn’t kill the EC2 instance! You are just logged out of it. Go back to the EC2 console and your instance still will be running.
The good news is Git is installed be default on this AMI. However, the bad news is we still need to install Git LFS and securely get our private key onto our instance so we can clone our repository.
From the terminal, run a script from packagecloud that allows us to install Git LFS using apt-get utility.
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
Now install Git LFS.
sudo apt-get install git-lfs
Typing which git
shows the Git install in /usr/bin/
on your instance. Typing ls -l /usr/bin/git*
shows Git and Git LFS in the same folder, which is exactly where they need to be.
Go to the repository you wish to clone on GitHub account and copy the SSH link for the master branch under the code tab.
In the terminal, move to the home directory and then clone the master branch to this Amazon EC2 instance. Now your Git repository is cloned to the home directory of your Amazon EC2 instance.
cd /home
git clone git@github.com:CCheCastaldo/MAPPPD-Model.git
So that was a lot of hassle, no two ways about it. Let’s now take a different approach and use a set of command line tools to automate this entire process. Sound good? Some quick preliminaries - when you log into the AWS website, you use a username (your email) and password to verify who you are. To use the CLI, you will not authenticate using a username/password but instead with an access key. This is a little confusing, so remember this access key identifies who you are to AWS. This is different from the key pair you use to connect to an instance ([III. Configuring and launching an Amazon EC2 instance][]). This is like being on a guest list to get into a convention (your AWS access key) and then having a ticket for a specific event (your instance) at the convention (your key pair).
Like before, let’s move the key to ~/.ssh
. I’ll rename it now too and change the security credentials.
mv /Users/coldwater/Downloads/rootkey.csv ~/.ssh/CheCastaldoAmazonAWS.csv
chmod 400 ~/.ssh/CheCastaldoAmazonAWS.csv
Check that it python is installed by typing python --version
in terminal.
If not, download and install the latest 2.7 package from the python website.
Download and unzip the AWS CLI Bundled Installer
cd /users/coldwater/downloads
curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
unzip awscli-bundle.zip
Install the command line tools. From AWS, this command “installs the AWS CLI at /usr/local/aws and create the symlink aws at the /usr/local/bin directory. Using the -b option to create a symlink eliminates the need to specify the install directory in the user’s $PATH variable. This should enable all users to call the AWS CLI by typing aws from any directory.”
sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
Check the install with /usr/local/bin/aws --version
. You should see the aws-cli version.
aws-cli/1.10.13 Python/2.7.10 Darwin/15.3.0 botocore/1.4.4
Type aws configure
. You should now enter your access key id and secret access key from your AWS access key. To see these values quickly just type cat ~/.ssh/CheCastaldoAmazonAWS.csv
in the terminal. Next chose the geographic region code closest to you. You can leave the output format blank (hit enter).
AWS Access Key ID [None]: AKIAIWPG6RASEO27MEUA
AWS Secret Access Key [None]: blahblahblah
Default region name: us-west-1
Default output format [None]:
The CLI stores the key information in a file named credentials
and the region and output format in a file called config
both in ~\.aws
.
Christians-MacBook-Pro:.aws coldwater$ ls -la
total 16
drwxr-xr-x 4 coldwater staff 136 Mar 16 10:51 .
drwxr-xr-x+ 34 coldwater staff 1156 Mar 16 10:49 ..
-rw------- 1 coldwater staff 29 Mar 16 10:51 config
-rw------- 1 coldwater staff 116 Mar 16 10:41 credentials
Define a security group and add a rule for SSH access. that you will use for your EC2 instances. The vpic-id corresponds to the EC2-VPC from your chosen region. You can create your own, or use the default, which can be found at Networking-VPC tab in the AWS Management Console.
aws ec2 create-security-group --group-name CheCastaldoDefaultSecurity --description "Default Security Group" --vpc-id vpc-9c5726f9
Add rule to this security group to allow SSH access to your EC2 over port 22 from any machine.
aws ec2 authorize-security-group-ingress --group-name CheCastaldoDefaultSecurity --protocol tcp --port 22 --cidr 0.0.0.0/0
To see or delete any of these security groups type the following in terminal. These security groups will also be available to you if you launch an instance through the AWS console. Also, this is how you find the security group id, which you will need to launch your instance.
aws ec2 describe-security-groups --group-names CheCastaldoDefaultSecurity
aws ec2 delete-security-group --group-name CheCastaldoDefaultSecurity
If you haven’t done so already, you can create a key pair from the terminal. It is best to do this in the ~/.ssh
directory. Also alter the permission on this key to be seen only by you.
aws ec2 create-key-pair --key-name CheCastaldoAmazon --query 'KeyMaterial' --output text > CheCastaldoAmazon.pem
chmod 400 ~/.ssh/CheCastaldoAmazon.pem
To see or delete any of these key pairs type the following in terminal. Like security groups, this key will also be available to you if you launch an instance through the AWS console.
aws ec2 describe-key-pairs
aws ec2 delete-key-pair --key-name CheCastaldoAmazon
We will now use AWS CLI, Git, bash shell, and R commands to (1) configure and create an Amazon EC2 instance, (2) connect to it in order to clone a Git repository, run the R script GenericEC2Run.R
, and save the R output, then (3) publish these changes to your project’s remote Git repository as a new branch, before (4) terminating the instance. This will involve a sequence of four scripts, each of which we discuss below in turn.
This script does what we did in section IV, except this time we will be using the AWS CLI commands to create and connect to EC2 instances.
Assign a shell variable called run
that is the name of the R script we ultimately want to run. It also specifies the folder containing all files associated with running this script on our instance as well as the branch name we will check out, push, and publish on our remote Git repository. Here that is GenericEC2Run
.
run=GenericEC2Run
Launch EC2 instance from terminal. Notice we supply the hardware configuration, the AMI, the number of instances we wish to create, the name of our key pair, and our security group. We also specify the shutdown behavior. Finally, we query the output from this command to get the instance id, which we save to a shell variable called instance_id
whose value is the unique instance id associated with the instance we just created.
instance_id=$(aws ec2 run-instances --image-id ami-d1e792b1 --security-group-ids sg-7342b517 --count 1 --instance-type t2.micro --key-name CheCastaldoAmazon --instance-initiated-shutdown-behavior terminate --query 'Instances[0].{d:InstanceId}' --output text)
Wait until the instance is up and running before doing anything else. We need our instance’s public DNS to access this instance, but it takes Amazon a few minutes to create our instance in the first place before we can even grab it. The wait command will only finish after the specified condition is met- here it is when our particular instance is running.
aws ec2 wait instance-running --instance-ids $instance_id
Save the public DNS on of the instance we just created to a variable called dns
. To see it typeecho $dns
.
dns=$(aws ec2 describe-instances --instance-ids $instance_id --query 'Reservations[*].Instances[*].PublicDnsName' --output text | grep a)
echo $dns
Even though the instance is running, it can take several moments before you can SSH in. This bit of code forces our script to wait until port 22 is available on our new instance. Thanks to James Saryerwinnie for this bit of ingenious code, which can be seen in his AWS re:Invent 2015 presentation and found on his Git Hub.
wait_for_port() {
local port=22
local host=$dns
while ! nc -z "$host" "$port" >/dev/null; do
sleep 5
done
}
wait_for_port
Secure copy our Git Hub private key to the EC2 instance. The UserKnownHostsFile
and StrictHostKeyChecking
options removes the host verification prompt (the computer asks you to type yes or no to continue). Removing this is essential if we want to run these actions in a script.
scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/CheCastaldoAmazon.pem ~/.ssh/id_rsa ubuntu@$dns:~/.ssh/
Use SSH to remotely change the ownership of the home
and /usr/local/lib/R/site-library
folders to ubuntu so we can clone our Git repository and install new R packages that did not come preinstalled with the AMI.
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/CheCastaldoAmazon.pem ubuntu@$dns "sudo chown -R ubuntu:ubuntu /home"
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/CheCastaldoAmazon.pem ubuntu@$dns "sudo chown -R ubuntu:ubuntu /usr/local/lib/R/site-library/"
Secure copy a shell script Job.bash
containing the things we actually want to do on our new instance to home
. Note we are using the shell variable run
we defined at the start of Startup.bash
.
scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/CheCastaldoAmazon.pem /Users/coldwater/Library/Mobile\ Documents/com~apple~CloudDocs/Research/MAPPPDModel/ModelBuild/$run/Job.bash ubuntu@$dns:/home
Run the shell script Job.bash
.
run=GenericEC2Run
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/CheCastaldoAmazon.pem ubuntu@$dns "bash /home/Job.bash"
After Job.bash
is done, terminate the instance.
aws ec2 terminate-instances --instance-ids $instance_id
This script performs the tasks we did in section VI, mainly installing Git LTS and cloning our Git repository. This script then runs Rsetup.R
which we discuss below.
Assign the same shell variable called run
that we did above.
run=GenericEC2Run
Install apt-get, then Git LFS, and finally cloning our own Git repository. The only new wrinkle is that we create a config file in ~/.ssh
and add a line to suppress host checking when accessing github.com. This will keep the instance from forcing you to type yes when cloning. Lastly, we set our Git identity globally on this instance so we can publish our commits.
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
cd /usr/bin
sudo apt-get install git-lfs
cd /home
echo 'Host github.com
StrictHostKeyChecking no' > ~/.ssh/config
git clone git@github.com:CCheCastaldo/MAPPPD-Model.git
git config --global email.address "cccweb@icloud.com"
git config --global user.name "CCheCastaldo"
Checkout a new branch, which is named using the run
shell variable value, GenericEC2Run.
cd MAPPPD-Model
git checkout -b $run
Run Rsetup.R
from the command line. Remember, Rsetup.R
sources the actual R job we want to run and is the motivation for this entire tutorial! Here we pass it a command, which is also run
shell variable value, GenericEC2Run.
cd ModelBuild/$run
Rscript --no-save --no-restore --verbose Rsetup.R $run
Lastly, stage and push our commits to the local branch named GenericEC2Run
, before publishing this branch to the remote Git repository.
git add --all
git commit -m "ec2 run complete"
git push -u origin $run
This script installs the R libraries needed for our R script, GenericEC2Run.R, to run and then sources this script.
Use the args
command to store in a vector the arguments passed toRsetup.R
. In this case it stores only the run
shell variable value, GenericEC2Run, as a scalar args[1]
.
args = commandArgs(trailingOnly=TRUE)
Use args[1]
to define a text file where we will save all our R output and messages with the sink command.
out <- paste(args[1],".txt",sep="")
sink(out, type = c("output", "message"))
Install necessary R packages to run GenericEC2Run.R
.
install.packages(pkgs=c("plyr","gdata","knitr","markdown","rjags","coda"), quiet=T)
Change the working directory to the folder containing GenericEC2Run.R
and source this script.
modelBuild <- "/home/MAPPPD-Model/ModelBuild/"
wd <- paste(modelBuild,args[1],sep="")
setwd(wd)
file <- paste(args[1],".R",sep="")
source(file, verbose=TRUE)
This is my tentative workflow suggestion.
Create a folder that has the four scripts discussed in the previous section.
When you want to do an EC2 run, make a copy of the folder and replace GenericEC2Run.R
with whatever R script you want to run on your instance.
Rename the folder with this same name.
Open Startup.bash
and Job.bash
and change the run variable to also match this name.
In the terminal go this folder and run the Startup.bash
.
bash Statup.bash
When the job is done running, it should appear in your remote Git repository as a new branch.
Merge the branch with your master branch and delete the remote branch and then pull to your local Git repository.