Celgene research collaboration environments
Learn how Turbot enables Guardrails allowing a small team to enable rapid research environments.
Disclaimer: Automated Transcript
in the past decade we've seen tremendous growth in the in the collaborative research taking place on the cloud and even these quotes that are listed here are a few years old at this point and now what's becoming sort of the newest normal is to have you know almost all your kind of collaborative research take place on the cloud as opposed to just using it to meet in the middle or anything and really the factors that are behind this are the you know both technical and sort of scientific in the technical sense we now have the ability to do these things where we didn't before and I'm sure most of you or all of you can remember you know the days past when we were primarily sending hard drives through the mail and you know even just spreadsheets as we you know tried to trade scientific data but you know sorry I'm feeding back and I want to blow anyone's ears here but in addition to that we now have you know other factors taking place which is that the large amount of scientific problems that are being addressed today are so large that they tend to exceed the scale or ability of any kind of one institution can muster and so what you find is a lot more institutions taking advantage of the fact that they can collaborate much more easily and then you know actually doing so well on the cloud and you know obviously.
We hope that AWS is that home for all your collaborative scientific research needs and in the years it's evolved from you know a simple on-site kind of model like I said trading hard drives keeping you know all kinds of high powered workstations under your desk and things like that this sort of shadow IT that was built around any type of scientific research effort or even any type of collaboration that was taking place and it's grown to actually become you know what we started to do was meet in the middle right so you know that means using s3 to trade data instead of sending hard drives and using common compute platforms instead of and you know made available that way instead of dressed trying to you know work individually on where stations that everybody had set up on their own and so now what we're seeing is this you know growth into what we call DevOps but really it is just sort of a loose name for you know the coordinated development and deployment of common applications and common environments and common platforms so that these collaborative efforts are actually taking place on a common you know and within a common framework so today what we have is a review of a couple of those architectures for doing collaborative research brought to you by Ryan and Steven from H Li and Lance from Celgene and they're going to kind of go over the what the what they've set up and the general you know architecture that they've deployed so ok you guys hear me okay well we got to go to we lost my original slide so my name is Ryan you LASEK and this is Stephen Terrill and we're from a company called human longevity and the title of our talk is building a platform for collaborative scientific research on AWS so some of the topics we're gonna cover today about human longevity our company some of the challenges that we face solutions that we came up with the journey to get there and then some closing thoughts so this is a chart from the CDC and it describes life expectancy over three different groups or three different generations and the first shift you see from the blue to the yellow line is an improvement in life expectancy that was due to the eradication of infectious disease in the early 20th century through antibiotics and vaccines in the second shift you see the yellow to the red line is an improvement in life expectancy due to better health care better practice of medicine improvements in sanitation and improvements in the economy and
So the big challenge of the 21st century healthcare system is to continue this progress through managing the progression of chronic disease across the population so HL eyes mission is to change the practice of medicine making it more preventive predictive personalized and genomics-based with the goal of empowering individuals to manage the progression of chronic disease and have healthier fuller lives and so in order to do that we need to move beyond medicine as clinical science to medicine as data science if you look out of our current form of medicine our descriptive form of medicine we don't really collect that much data on us 8 as a patient so maybe laboratory results or medical reports about three and a half gigabytes of data in total
So our our first step with our health nucleus clinics was to start doing deep numerical analysis on all organ systems of people going through the clinic and capturing a lot of different data so we do things like sequence your whole genome sequencing identifying your microbiome your metabolize doing MRIs and we collect this body of data about 150 gigabytes in total and create a digital you or a digital health profile now to understand that profile it has to be understood against a background has to be put in context against a population of everyone ideally and so to do that you need population studies and one such study was done at human longevity on 10,000 people it was analysis of 10,000 people's blood viral the viruses in their blood the idea was to get a background what's what's normal that so we can compare these digital you these individuals against that background so that was done that se was done at about a petabyte of data about a trillion similarity searches it was run over a course of three weeks on ten m44 excels and the idea is that we need to be doing studies like this but not on 10,000 people on tens of millions of people ultimately X bytes of data thousands of compute instances and that's just one study but thousands or tens of thousands of study to create a knowledge base to actually compare these individuals really quantify people's biology and and realize this vision of being able to empower people to manage the progression of chronic disease throughout their life.
So scale matters scale of the storage of the data scale of compute scaling the platform being able to plug in new analytical tools accommodate new data sets push the science board and take advantage of opportunities as they present themselves to bring products to market so we'll talk about that platform piece today so in realizing this vision one of the first key challenges we ran into as we scaled as a company who grew as a company really quickly went from 0 to 300 people in two years and during that time we sequenced about 35,000 whole genomes process them generate about 35 petabytes of data and one of the things we observed is that teams are building out different pipelines on different platforms different ways with different technologies different approaches and that made it really difficult for people to collaborate across teams to share tools across pipelines and the bioinformaticians we're getting really bogged down in infrastructure and not able to focus on the science and we accumulate a lot of redundant infrastructure that consume time and resources and some of these pipelines are complex they could have complex workflow orchestration logic complex our diverse resource requirements across steps within the pipeline and we also accumulated significant storage into PUE costs sometimes due to using on-demand instances or underutilizing resources and finally we found it really challenging to get these genomics pipelines in the production often because the the infrastructure they were using was very different so the solution was to create a common platform for genomics pipelines using AWS managed services the idea is to take this platform and have a much simpler pipeline definition that we can use across the organization and to optimize for cost at a platform level so everyone can benefit from that optimization and move to continuous delivery model so that we're always a button push away from production
So if somebody changes the path of the platform or if this changes two pipelines or just changes to tools your button push away and with automated quality gates that get more rigorous the closer you get to production because quality matters the data has to be right people are making health decisions based on this analysis and we also need to ease and accelerate this transition from R&D to production by having a common language common way of defining pipelines so the journey to get there was iterative for us and it had five key steps along the way the first was a new customer that needed an exome report and they needed things like ancestry trait predictions pharmacogenomic predictions at the time we're generating these reports manually and we needed to automate that process and we needed it up and running in two weeks because we needed to start integrating other systems around this one so our solution to this was simple we used sqs and we would put sample messages into this queue and then we had very simple Python applications that would pull that queue pull down a sample and then run a sequence of bioinformatics tools on that ec2 instance produced Jason results put those in s3 and there was downstream systems that would take that data and create a PDF report and what we did is we took the bio from Maxwell's we baked them into the a.m. eyes themselves that's how we deployed these tools into this environment then we use ops work to manage the instances so here's an example of what the code looks like so on the right we have a pipeline definition that's in Jason and what we have here is just one step
It wants that pipeline what I've highlighted here is what we would do to describe a step so you describe the name of the step maybe its inputs its command outputs any file dependencies that you would depend on maybe a reference genome file and then the code was really simple it's you pull a sample down from sqs at Q when you have that sample you're gonna iterate over each step pull external file dependencies from s3 pull maybe VCF files file from sample files from s3 and then run that bioinformatics tool locally and remember they're baked into the ami so a pretty simple idea one of the key benefits here is that the work really well as a starting point we were able to get up and running in two weeks and get the project up and running and auto scaling and opsworks is really easy you could scale on CPU load or it get scale on a cloud watch alarm in our case we scaled on queue depth so every time the queue depth for the samples went up we would add bore nodes and every time the queue depth decrease we would remove the nodes but there's some drawbacks so there's pain around manually building and updating the ami everytime bioinformatics tool changed we had update the tools and then update the ami and deploy it we realized we're building our own workflow engine which wasn't really what we wanted to be in the business of doing we couldn't optimize for workload resources at each step because we're taking a sample and running all the analysis all four or five tools on one instance so we couldn't write size each step for the appropriate instance and we weren't taking advantage of spot we were using on demand at the time so the next challenge that we faced was adapting the tool change there are a lot of requests coming in to add traits to add pharmacogenomics information into these reports and it was triggering a lot of tool changes which was trigging rebuilding these a.m. eyes so we wanted a more flexible way of accommodating the tools so what we did is we moved migrated to docker and what that allowed us to do is have bioinformaticians that could create a tool docker eyes it run it on their machine make sure it worked
It's been up an ec2 instance run it there make sure it worked when they're happy with it they could push it to ECR Amazon's docker image repository and then we can incorporate that into the pipeline so we made some simple changes to the architecture we got rid of this baked a mice and we moved to a standard Linux Amy with dr. installed and we just ran docker images on those machines and we pulled those docker images from ECR so we essentially swapped out the mi4 docker the custom ami and then for the pipeline definition it's pretty much the same what we have now though is a path to the image in ECR so we know where to pull that docker image from and some arguments potentially and then we also have this additional step when we're pulling down a step but we're running that step we have another thing to do here which is to pull the image from ECR then pull external files sample files and then run the step we would run docker run so before we're running the tool directly now we're running docker run so the ban the benefit to this is now we can easily accommodate tool changes but we realize boy we don't want to be in the business of supporting doc or ourselves that was kind of painful so for example we would run into conflicts between docker versions and ami versions so we didn't want to be supporting doctor ourselves and there was pain around managing the images the docker images themselves so the next challenge we face is we kind of became a victim of our own success the system was working really well for a given pipeline report pipeline now we wanted to be able to run lots of different report pipelines on the same platform so we had flexibility and accommodating the tools with docker.
Now we wanted to be able to accommodate flexible pipelines pipeline definitions and so to do that we dropped our custom version our custom custom workflow engine and we move to SWF and flow so SWF is a service in AWS that essentially does workflow management it gives you gave us things like versioning of stop some workflows retry logic a council to kind of track things API is it's really a tracker for complex workflows at the cloud and the ATS flow framework is a a convenience framework that's written in Ruby that makes it really easy to write distributed applications the script that interact or use SWF and it does a lot of the heavy lifting for you so you could just focus on your piece your code so the architecture evolved a bit here but now we've dropped sqs and the queue and now we have a topic so we moved the publish/subscribe model and we have an event that comes off this topic and a lambda that's subscribed that event takes that event in runs SWF submits a workflow execution and now we've swapped out the Python applications for these Ruby flow applications and within this model has to be open flow you have deciders and activity workers so decider processes will pull a workflow down decide what the next step should be submit that back to SWF and make it available to activity workers to pull them that those jobs down and run the activities report the results back to SWF that's how the model works and then there's a really great chef cookbook that makes it really easy to create these decider and activity worker processes in ops work stacks so we use that as well so pipeline definition similar but now we've added this async flag so you can run some steps in parallel and then we've migrated to Ruby instead of Python because this frameworks Ruby framework the two things that you need to be aware of its turbulent flow is this send to sink an exec step send Asik is gonna submit a job desktop.
You have to run and it's gonna return a future that you can wait on so this is your mechanism to run things in parallel an exact step is your mechanism to run things in sequence so creator array of futures an empty array iterate over each step the ones that can run in parallel you submit them collect all the futures wait for them all and when they're all done you run each step in sequence and this is where you could have steps that depend on a previous step can make sure that they run in the right order put outputs in s3 for example and then this code is the decider code that gets deployed within a decider process and this other code here is the activity code that gets deployed an activity process and those recipes make that all easy so you get stand up these decide earn activity workers within a cluster so big benefit here much easier to accommodate new pipelines now and run steps in parallel and we have things like handle the ability to handle failures and retries and versioning but the workflow approach or pipeline definition approach wasn't really flexible enough to accommodate more complex pipelines that we had so we need to move beyond our simple JSON definition to something more sophisticated and for that I'll hand this off to Steven forward and what we had found is that we've become even greater victims of our own success people were coming to us and asking if we can onboard any type of pipeline at HLI to this platform whether that be a pipeline to do secondary analysis I'm good time do you hear me back there now all right so we wanted to be able to onboard any type of pipeline to this platform and whether that be a secondary analysis pipeline or maybe a pipeline that generates data that's fed into a data Lake in addition to the report generation pipeline we were already running and then once we had everything on board it to this common platform we wanted to optimize for cost at a platform level and use something like sparkly too heavily reduce the compute costs so what we need is a managed service for running docker eyes pipelines now AWS doesn't really offer such a service but they do have all the component pieces you see here to put one together yourself so that's the approach we took and we call it dhaka pipeline now there's kind of three key concepts with dr. pipeline that you need to be aware of and the first is registering a task with the system before in a pipeline which we saw previously
If you had a bioinformatics tool you needed to kind of define it within your pipeline itself but here we've broken that out so now all you need to do is register that bioinformatics tool as a task with docker pipeline and you tell dr. pipeline how to run that tool and the resources that are required to run it next you need to register your pipeline with the system and a pipeline is composed of two important parts the first is your steps file which references tasks that are already registered on docker pipeline and the other is a little bit of orchestration code again written in Ruby that tells the platform how to orchestrate the individual steps within your pipeline and then once your pipeline is registered with the system you can simply call DPL pipeline run and then the pot the platform will pull that pipeline out and run it for you so here's kind of what that looks like if I'm a new researcher at hli let's say I've written a bioinformatics tool that's generating some ancestry data and here you can see that I've installed the DPL or the the docker pipeline command-line tool that we call DPL and I'm registering my ancestry tasks with the system again we have a bit of JSON that kind of describes that task this looks pretty familiar we're telling it what image to run and then the arguments to that image but we have this new resource requirements block that defines the resources that are needed to run this specific task so here I've said I need 40 cores I need EVs size small I can specify what UVs type I like maybe a snapshot as well that contains things like reference genomes or other static files and then a memory multiplier that I can use to basically select the high memory instance for my tasks next I'm going to register a new pipeline that I've built on docker pipeline and the first piece I need to include is my steps file so here instead of defining the step within the pipeline itself I'm including a tasks that have registered on docker pipeline and this is my ancestory task however I'm also including a traits task now this is already registered with the system it could have been created and registered by me someone else on my team or a different team entirely.
I don't really need to know how to use this tool or what it's what's needed to run it because dr. pipeline knows how to run this for me so it's very simple for me to include it in my pipeline I also need to include a little bit of Ruby code that instructs her that tells docker pipeline how to orchestrate the various steps within my pipeline so here as similar to before the person registering the pipeline now is actually writing this code so similar to the floor this is how we run something synchronously and then we also have the ability to run things asynchronously and if I choose I can write some conditional logic in here so that I can do things like splits merges multiple-choice all of those complex workflow patterns that are needed in more complex pipelines then once I've got that registered system I can call DPL run and I get back a run ID and a workflow ID that I can use to track my pipeline as it's being run by docker pipeline now here in the SWF console we see up top that I've got a workflow execution running and then down below there's an activity running in my pipeline so this will correlate to my ancestory tasks if I'm running synchronously so this one will run first and then if we look in the ECS console we can see that I have my ancestory container running on an ECS cluster so here's kind of how all this works a nice little architecture slide and it's a little complex so let's step through each piece of it the first is my docker push so I'm registering my docker image with a repository in this case ECR then I'll register a task that references that docker image and tells docker pipeline how to run it and what's needed to run it then I'll register my pipeline with the system that uses tasks that are already registered and then finally I call DPL run so that's what it looks like from the end users perspective someone who's building a pipeline at hli now behind the scenes after the DPL run call is made the lambda function will start a workflow with SWF from here a decider process running in our ops work stack will pull that workflow down from SWF and inside that will be the work for the pipeline that we're trying to run and the parameters that need to be sent into that pipeline will then pull the pipeline definition from our registry which in this case is dynamo and then we'll execute that bit of Ruby code that orchestrates the pipeline so in our case that will run basically the ancestry tasks first and that will make an activity available in SWF that is then pulled down by an activity where are also running in our ops work stack once that activity worker receives the activity it will call a lambda function saying I would like to start this task on docker pipeline and here are the parameters to that task is then given back a task identifier that it can use to pull for the status of that task then the lambda function will retrieve the task from our task registry and determine the appropriate ECS cluster to submit that specific task to now the idea here is that we have a nice es cluster setup for every kind of combination of resource requirements so kind of on the fly we're looking at that task and saying ok I know it needs this specific set of resources and it will create an e CS cluster that can handle that type of work so any task that needs that same set of resource requirements will get routed to that same ECS cluster now that easiest cluster doesn't exist we will create it at that time and we'll configure a spot fleet that can handle that type of work so we'll make sure we're adding the right types of instances with the right EBS size and other resource requirements and then we'll attach it to that specific ECS cluster then in front of that we'll put an s qsq that the task actually gets submitted to then we have lambda functions monitoring queues that are in front of re CS clusters and they'll pull down a message and attempt to submit that specific work to e CS to actually run the docker image on e CS the PCs comes back and says we don't have enough resources the message simply goes back on the queue to be retried at a later time we're monitoring all of these queues and when the messages pile up and cross our threshold we know that we need to add more capacity to our spot fleets or to our ECS clusters so we'll look up the spot fleet that is powering that e CS cluster and then we will add capacity to it once those nodes come online they join the e CS cluster and the next the work is submitted to it
It will start our docker image once we're on the note itself and the docker image is running they'll be reading and writing files to s3 now eventually all of the activities in a work flow will complete those all get tracked in SWF and when they all finish we know that our workflow is done and we've run our pipeline successfully so that's how a pipeline gets run on docker pipeline so now we can accommodate complex workflow patterns and we can define those pretty simply we can also share tools across pipelines since we've taken the bioinformatics tool definition out of the pipeline definition and registered them separately on the system they can be easily included in other pipelines because people know that when a task is registered on docker pipeline that it will just run on the system we're now also optimizing our instance types for each step in a workflow because we're being very specific about the resource requirements that are needed for that task so we're right sizing our instances we're also no longer supporting doctor ourselves which is a big win ECS makes it very easy to just start docker images and run them and then Amazon handles all the configuration that's needed on the docker side for us and we're also getting some pretty massive cost savings because we're using spot to run all of our jobs so the next step we wanted to do was go faster with continuous delivery we needed to be able to deploy into production very quickly so that means automation we need to understand our deployment process and automate it we need to have really great integration testing at each step of our deployment and we need to have push button to prod so at hli we're using code pipeline to orchestrate our continuous delivery pipeline which will start a deployment in our dev environment which uses AWS code deploy to stand up all our infrastructure and we run a quick smoke test to make sure that looks good we then do our deployment in our integration environment and run a much broader suite of integration tests that's going to be exercising all parts of the system to make sure they're working as intended from here we push to our stage environment where we're doing Bluegreen deployments
So we'll have an inactive stack that we deploy to integration tests there make sure everything looks good and then we'll switch that inactive stack to the active one and then finally we'll send a message to SNS which sends an email out with a link that someone can click to push the latest changes to production now this is what our integration environment actually looks like from the code pipeline console and here we can see we're running our code deploy our AWS code deploy deployment and we also wrote a lambda integral a integration that notifies our slack channel when we started a deployment here we're deploying our opsworks application in our ops work stack and then here we're running all of our integration tests in our integration environment and then also notifying our slag channel when the deployment is complete so now it's very easy for us to go from hearing about a bug to having a fix deployed into production within minutes so we're deploying to production multiple times a day now and with that I'll turn it back over to Ryan for a quick summary hey so there's a couple key benefits we got from all this a dramatic simplification and pipeline complexity so one example we went from about two thousand lines of code from one pipeline about 20 lines of code it config file significant reduction in time to generate those reports so some of these things were taken like a couple people three weeks then we're able to get that down to about five hours significant cost savings with spot so these are the compute costs for particular report but in some cases we went from $32 a six for a given report daily deployments of the platform changes to production so it was taking us sometimes weeks or months to get something into production we were able to get that down to daily and dramatically ease your hand off between bioinformatics and engineering because now we've gone from code to configuration
Right so it's just passing a configuration alone and finally some next steps things were thinking about in the future one we've done a lot of work to simplify pipeline definition but now the big challenge is defining the tools and building the bottom from Attucks tools so we think there's lots of opportunities with some of the managed services and AWS and frameworks to make that a lot easier for bioinformatics scientists and then find then the last thing is there's a desire to be able to run SPARC clusters for a given step so instead of running out in one instance with maybe 40 cores you can run a cluster of 60 machines so those are some of the things that we're thinking about next and with that I will hand it over thank you hello everyone I'm Lance Smith from Celgene I'm real quick this is what we'll be talking about today some of a little bit about the company some of the trends that we're seeing in the collaboration models plural that we have and some of the configuration and steps that we have learned along the way here so real quick if you don't know who we are we're a biotech all the way from Discovery all the Wales sales and distribution and that last bullet point we have 60 sites globally and that has a big implication I'll talk about about the networking that we have at our sites here right so some of the trends that we're seeing as Patrick talked about earlier a lot of collaborations and partnerships Celgene we're you know this is what we do we sign up a lot of collaborators work with a lot of universities and probably some of you in the audience here we work with your organization's Rd of course very very quick they will sign deals and not tell IT so you know it's our job to make it happen and the last bullet point here cloud native solutions you know we're starting to see the software market start to mature we're no longer seeing forklifted applications but we're seeing applications written directly for Amazon either you run them as a software as a service from the vendor or you run them in your account maybe someplace in between the one thing that you can't do though is run them on premise real quick on our collaboration this coming weekend we're going to have an announcement
I can't quite put the press release here but this coming Saturday we've been in dart stealth mode for the last year multiple myeloma genome project we've been working with a number of universities we're going to be opening this up to a greater world so if your organization does myeloma research please contact us and then maybe we can work together on this all right so I talked a little bit about our collaborations we have multiple collaborations we have many different types of science that we do so in any given day we have hundreds of biologists and chemists around the world in all sorts of science and we need to help them out they want to work with a university it's our job to help them enable that so you know each of the scientists they have a different type of science different type of software different types of output and we can't have a single platform that can support them all so we have a couple of dozen collaborations but we can pretty much group them into two separate categories one is the bench chemist now we are these are true end users that they just want a simple way of interface click on here upload your data and then your data is there single sign-on ties into our corporate network and then we have the other type of collaboration the HPC users these are you know we have our PhD scientists but have for them are also computer scientists they're not you know hardcore super science computer science developers they know how to code and they want to write their own algorithms they want shell access they want API access and they're and because it's HTC
We're talking hundreds thousands tens of thousands of nodes working on petabyte scale data from IIT point of view though both of those two categories we could we treat them the same we have multiple vendors coming in that Seles software we have multiple research groups working together they all want API access how do we how do we you know keep them safe from each other and between projects so what we do is we have a multiple account model or each collaboration gets a separate account we parked them in an account and what they do they can affect another project they can believe their own data but they can't affect anyone else's has multiple benefits and I'll go into a little bit of that later but from our point of view the management is the same here's one of our architecture diagrams for our mint one of our mass spec systems the users come in they hit one of web servers all the scaling group behind it alb they then upload their files of s3 if single sign-on that hits our ad servers and our DMZ a processing is also done in auto scaling groups with spot that saves us a lot of money all primary data is kept in s3 metadata is then kept in RDS and SQS is used for a queueing of tasks between between jobs HPC this is a little bit simplified compared to the last slide there the scientists themselves write the pipeline they write the algorithms they write the coordination they work with the data directly so we can't give them a simple web interface they shell in with a bastion host we have a couple of different Bastion host with a role attached to it they scientists themselves generally don't get Amazon keys they work on the bastion host which then has the iam role in which they can operate so they can work on the algorithms their up data they upload directly to s3 from wherever and may be coming from.
Then we have a couple different compute clusters those like the compute nodes that can handle stos those pipelines have better failover they're cheaper and will you spot of our other pipelines we unfortunately have to use on demand but we are working with the scientists to move that into spot as well connectivity so how do we connect into these environments so when a project first comes to IT collaboration or not we you know we are here to try to help them we're not trying to be the Department of no how do we help them out but one of the few things that not is not negotiable for us is that when a project comes along they have a choice IG w or v GW we can't have both what that means is that with the IG W that V PC can talk to the outside world directly or over the v GW can talk to down premise systems we are a pharmaceutical company we have a lot of data that bad people want and we can't let them have it however hydration accounts you know they have to have the IG W in order for the collaborators to come in so we top off the BG W but we can still use the company Direct Connect which to communicate and upload data with a public interface on the direct connect we can acts access s3 you can also access all the public IPS on with SSH and whatnot and then do my laser pointer here but you won't feel see it so there's a number of options if you know if your organization is looking to connect into Amazon you have a number of options the easiest one to do is just you know the internet access SSH in boom you're in some security issues there so you could go with a deep VPN connection there but with our multi multiple account model managing multiple VPN connections is very very painful so we did that for a few months it's not for us we then jump directly into our ten gigabit connect so we have two of those two more 10 gigs coming one on each side of the country firing up on EU as well there's a big decision between one gig or sub one gig and one gig and hired direct connects and it's not just the speed almost the speed is almost for us a separate concern if you're less than one gig on your Direct Connect you only get one interface for 10 one and 10 K you get up to we start with 50 and you can go about 100 or so and the reason why that's important is that when you each VPC needs a virtual interface and if you want to low upload lots of data you also want to use a public interface to upload to s3 so media right off the bat you need to for a multi account model like us you need 20 30 50 separate interfaces.
If you're at a sub 1 gig right connect you only get one so we recommend 1 gig or higher now but before you go off and buy your direct connect or some additional choices that you have to make or decisions that you have to make a big one for us is region region selection if you're on the East Coast it's easy us East one if you're on the west coast half of my operations are on the east west coast to choice is a very hard decision u.s. west one versus u.s. west - we have a lot of on-premise databases and we're gonna be working in a hybrid mode for the next 5-10 years so latency is a huge concern to us that's why we want with us us one however every week scientist users it lands you know FS is great we'd like to have it it's not available in u.s. West one so when Amazon releases new features they always come out in Seattle Virginia and the island data centers and we're all working on firing up Frankfurt as well fully knowing that new features come out in Ireland and not in Frankfurt but latency is a big deal for us and that's what we have to do so multi account model so like I said we have a number of projects not just you know collaboration accounts but we have a dozen or so collaborations 20 or so on premise or company connected accounts and this is how we manage them our total team - FTEs contacting company a part-time FTE from Europe actually.
We have one additional headcount opening so if anyone's looking for a job let's contact me but you know we have very very small staff and this is not only we do we also maintain our on-premise clusters we have six or seven research site we have to maintain all that with the skeleton crew so we have this tool from turbot that's their website there it helps us manage all these environments so 30 plus accounts as by basically two people so we're talking hundreds the nodes automated security I'll talk a little bit more about that but it really helps us manage all these environments security policies and automated auditing between all the accounts also one of the things that the tool allows to do is the harden the Amazon environment in which these collaborations take place so you know we want to give these developers freedom in which to develop their software freedom for the vendors to upload and maintain their software but what we can't do is have these developers compromise our on-premise solutions so what we do with turbot is we isolate off all the network controls so anything to do with VPC security groups peering the virtual interfaces all that is restricted away from the project team and you know we are the IT group maintains that we work with individual project teams to work on their security groups so they will tell us oh we need these security groups these ports and IPS are problem they the project team then can take those security groups and assign them to the verb to the virtual machines that they create so I don't need to get involved they create them they can pick with the what security groups they want object controls which is also very very important since s3 is not part of a V PC that is potentially accessible from outside so what we do is with the turbot program we allow our end users to create their own buckets don't call me out I don't really care projects and create their own buckets tool automatically picks that up boom slap sauna up policy so we predefined a policy that's applied to all buckets no matter who who made them credentials now we really don't like to give out credentials sure as heck don't want them baked into any sort of virtual machine
So with the turbot program we can automate that we don't give out as three credentials s3 a little bit and then auditing if there is a potential security violation the tool will find it and either fix it automatically or tell me tell my team and then we'll come along and fix it the collaborations in general of course we use ec2 in ECS SV the document repository for everything EFS in the regions that we do use that do have it anyways we really want it in u.s. West one of course VPC and Direct Connect some of the other services that we use EMR what else a few other services those are the primary regions we use um the ones without the stars those are the ones with the direct connect and we have a couple of collaborations that don't need the direct connect so they fire up in the u.s. West to primarily because that's where EFS is a lot of people say uh-oh why'd you go why'd you go AWS for speed of course but how our security and isolation is very very big for us we are a pharma company we have a lot of good stuff and our on premise network is not sub segmented it's a flat network so if we have a problem poof they're everywhere about going into Amazon all these projects are instantly isolated from each other so in the event of a breach we can woof that one little project poof you know it's gone but everything else is safe and of course elastic nature some projects come along hey you know we could use 100 terabytes you know I could do a hundred terabytes on-premise scientists though they're not not really thinking down the road always so we had a project 100 terabytes up front three months later they're at a petabyte that I can't do on premise access for our collaborations so these are the software's for our bench chemists their cadre shion's that they do we don't write code we don't write software we you know our job is to help find cancer cures we're not a software company so most of our bench scientists we have common software that we purchase how native that helps us do the collaborations so what what access do these vendors need what access do these developers need the users come in with a single sign-on of course but the vendors themselves need additional access and we make them tell us exactly what they're looking for
We don't say oh you get start no no would you get star and you know it's a little bit of a negotiation with these vendors they don't they it's kind of a give and take interestingly if the smaller the software company the better they are at finding out or telling you what they need so some of the small companies what permissions you need they already have the document on the website here you go we love yet large consulting companies not so much a couple months ago a large consulting company said oh just give a star that's not gonna happen bit of disagreement there the next day they then sent us a 40 page document pretty clear that they just went to Amazon website copy it off the entire permission list and send it to me like oh that's not going to cut it either our HPC environments so this is uh this is where our scientists developers come in so we give them shell access but a lot of it depends on how they launch their software some of them have the golden ami so that we have to work with them in which to to launch these we make sure that the scientists themselves use good aim ice so we whitelist it so if you they come to us with an ami and it's not white listed they can't watch it so we make sure that they have good software and the reason why they have golden a me armies is that even though we don't want to be in the business of maintaining software or maintaining images because like the HL guys are saying it's a ping every time there's a software image update and a new keys new software you know we have to do a fair amount of work to do that but when you're watching thousand ten thousand nodes the overhead of chef or another software deployment package five ten minutes adds up a lot times ten thousand so these users they come in and via Bastion host the SSH in we don't give them keys and which to do that they well they have SSH keys but they don't have a Amazon keys that Bastion host then has an ami role I am role and they can then access all the other Amazon services from their AWS console nobody has a direct login database console not even me no one in not even no one in nineteen nobody
So we have the turbot program that allows us so it's a single sign-on into this platform we then use that platform to launch into M into the AWS console the SDS token the good thing about that is no me and my eunuchs I've been here we have what 30 plus accounts in which that we have to manage can't manage 30 different passwords for 30 different consoles that's crazy but now we have a single login and then poof we're into any one of them in the event that one of us leaves a company you can quickly move that person from that single console and they're removed from all the all the sub projects I'm not going to each one of these but these are some of the access rules that we put in place for our buckets so it's automated like I said the projects themselves are allowed to create their own buckets robot comes along poof slaps on the on this rule so in general we require server-side encryption for everything as well as encryption in transit depending on what the project needs we may increase that or decrease that but by default that's what they get and then it's a negotiation you know if they need last month anymore they can come talk to us about that we also use the bucket policies than I am policies to put in place business rules so we have a number of collaborations number of universities and a lot of universities they have a little sensitivity about you know well if another university is in there and maybe they can publish first okay all right let's get together guys so what we do with our buckets is we have a particular bucket that's the use for upload only and anyone any other projects can upload their our data science team then comes along and manages that data each organization has a slightly different data format we put them in a common format for the collaboration and then we IT or the data science team moves that into the final repository that final repository is not readable from the outside.
So you know we have a petabyte of data people are sensitive we don't want collaborator a stealing from cadre derby fine this bucket is only accessible from within the VPC that we write so what do we do with all this data now we use Enterprise github to manage the code so we have a number of accounts we use the github it's accessible from not only one account with all the accounts and if a scientist is traveling and they want to work on their some code while they're traveling on an airplane they can do that without getting into our Amazon environment they just got to get hub check out the files and work on it data you know we are talking a massive quantity of data here that you know our legal team ten years ago didn't even think about it was just wasn't on the radar however you know we do need right now our data retention policies in amount of three years some data needs be kept forever and if you see the bills like I do that's an enormous amount of money or if not competive I'd scale data but working with the legal team on the data science team on what to do with the quantity of data currently right now we're sending it all the glacier lifecycle rules and then hopefully over the next twelve months or so we'll have a long-term solution for some of our collaborations we are you know with the cancer moonshot we are working to open source the data so there's some contractual issues there but long term that's what we want to do here this is my last slide so some of the lessons that we've learned here expect failure we're talking tens of thousands of nodes sometimes on a single day there's gonna be some failures and we cannot manually fix 10,000 nodes you have to automate some sort of repair mechanism ideally
You know if anyone here heard pets versus cattle Google that we've tried to get our scientists to adhere to that concept if there's a problem the virtual machine cut leather you start the rebuild itself you services as they are intended we're still struggling with this a little bit one of the key things that were violating here every best practices because of our users are going directly into the s3 file system with cyberduck or or cloudberry they need a common folder structure in which to navigate to find their files so you know the first five ten forty characters of a particular key are the same and when you launch a thousand ten thousand nodes to hit that same partition you're gonna over at Ewing out of text that's three file system there's nothing that we can do about that and we've been yelled at many times about that's three team though event users you know what are you gonna do another thing that we do are wordly working with our scientists is to not use a fuse file system to mount s3 we have some software unfortunately they cannot read s3 it's looking for a block storage system so the scientist use s3 CMD to mount a petabyte scale s3 bucket on a UNIX file system and it chokes regularly data transferred you know we're talking hundreds of millions of files being sent around very very small percentage failure but we do have some failures it's very very important to check md5 some of the soft left lessons here that we've learned you know we are our companies not that old but in the modern you know the company is over at 25 30 years old we have a lot of traditional enterprise folks here and it's a big jump the cloud is not the same and it's been challenging for us to work with some of the folks who want to use their past experience they you want to help us out but it's a challenge odo vendors and users we do work a lot with a lot of small companies a new employees to our company are also coming from startups and in the past they're used to putting on the credit card and buying an Amazon account okay full root access well you come into Celgene you're not getting star stars and sse4 users is it has been very very interesting we tell the users every time you give them a set of keys you have to have server-side encryption this is how you do it okay good they go to upload access tonight oh hey guys you didn't give me access long yeah we did this is how you do it on the last step last us in here is to get buy-in from the security team they are they in the very beginning they were it was hard working with them but now they know my second phone call but I make after getting a project request as a security team and we're now best friends we get along great they trust me I trust them we're here to work for the users that's why we're thank you very much
Okay so just a quick you know AWS summary so I hope you guys enjoyed you know reviewing the architecture and some of the lessons learned from both of these projects I think they've been just done tremendous work both have really you know kind of similar advantages you know in terms of allowing for rapid infrastructure deployment isolated work areas a lot of the stuff that Lance referred to and being able to really isolate users and different collaborative collaborative efforts from each other could be you know just very valuable and it's almost essential in this area they've been able to draw you know common components into a larger larger reusable framework utilize all the elastic resources available to them and of course you know and especially in the cases of things like Celgene it's accessible worldwide so that you know you can reach and have collaborative collaborators anywhere on the planet and still reach a common infrastructure and hopefully you know use all this to drive toward you know reliable and reproducible collaborative science that are at a scale that was previously unachievable and that would be about it we thank you for attending this talk we'll be able to take some questions not here but outside so you can catch us in between now in the next session which begins at 2:20 in this room thank you.
If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!