AWS re:Invent 2019 - labs of the future in life sciences
AWS re:Invent 2019 - Labs of the Future in Life Sciences.
Disclaimer: Automated Transcript
Hello, everybody. Thank you very much for coming today. So we're going to talk a little bit today about the labs of the future and certain things that Celgene is doing within life sciences. So we have an agenda. We're gonna work on it off off today, we've got a few introductions we want to do and then we're gonna move into two two phases. One is going to be how we integrate A.W. us and in the laboratory space. And then the last example is using digital pathology to see what we can do with this technology. So today with this, we have a very early adopter and a long term customer of A.W. as Celgene. They're currently storing and processing and analyzing about 20 petabytes of data. They're working on three major R&D initiatives with us. One is the Melanoma Genome Project. The other words around computational chemistry and then a workloads. They've recently done a merger with another very long term and good customer, Bristol-Myers Squibb. This new company hopes to become the leader in a bio pharmacy. So. Today, we have two speakers with us. One is Lance Smith, who's director of R&D Cloud, an HPC. And the other is Pascal Saanich, who is senior director, Translational Computation. So both them are long term I.T. professionals and research scientists and and come from the Celgene side of the world. So what are we facing today and where the challenges that these labs are facing? And we've kind of broken into four parts. One, we see a bifurcation, the data between the wet lab and the computational space. But in reality, this is a unified workload that needs to be data driven. So you'll do your research in your wet lab. Then you'll run it through different instrumentation. And instrumentation produces large amounts of data that we then have to then analyze and run.
We have found that we've gotten most of large low-hanging fruit that we can get and we need to move to more challenging targets and we need to come to the concept of a batch size of one. So there's a need for data driven techniques. That that's just that. Then using that data to optimize those workflows. And then finally, we have multiple streams of data coming in and that needs to be combined and then tagged. So it's it axes. One cohort.
And what we're hearing from the organizations is, is that we have kind of four major, major problems or major challenges they're also facing. We have an inability to aggregate and share of this data we're collecting both internally but also with the researchers and peers externally.
We then have these data streams coming in. We need to combine them. And then more importantly, we need to tag them and catalog them so we can actually make them actionable and then ability to go back and do look backs so that when we were looking for different methods or different paths of research, they're there to search for.
These workflows, unfortunately, are not optimized. There is there, especially when it comes to some critical lab equipment, that in much of this lab equipment has very large capital investments with it. And then the thing that I think as people and as employees we find most troublesome is the repetitive, low value manual tasks that these researchers have to do every day in and out, in and out, that if we can actually move some of these tasks in automated fashion and take that workload off of them, it allows them to do what they actually got into science to do and actually understand research. So the first thing we're going kind of move into. Is we want to look at the cloud integration and how we've integrated these labs with that. And with that, I'd like to hand this over to to Lance.
Thank you, Sam.
Good evening, everyone. My name is Lance Smith. I'm from Celgene now, part of the BMW family. So we're we're excited to what the future holds for us. We are a global pharmaceutical company, primarily specialty oncology and immunology, and we span drug discovery to clinical to manufacturing and sales. So it's really challenging environment that we have in our R&D and commercial all in the cloud. And how do we keep them separated? How do we empower our scientists to be able to do all this new cool stuff in the cloud? Now we have this global footprint, which also makes it very, very tricky for sending our data from on premise, from our wet labs into the cloud. But the good thing is we have a cloud first strategy, and that's just not our CIO saying we have a cloud for a strategy, but we have about 50 or 60 people now attending reinvent just to learn. So I people, data scientists, engineers and actually bench scientists are here to learn how they can do their data processing in the cloud. So this is this is a big thing for us and we really fully believe in A.W.. As you know, the biotech lab environment here in this room, you know, what it is, is very, very complicated. It's very hard to put into the cloud. It's what's all this really, really messy, wet, experimental stuff. We want all this data to move from one location, centralized to the cloud. So make the processing much easier. However, if you walk into a lab, you'll see it at a start up. You'll see dozens, hundreds of instruments become ACLJ and you'll see thousands of these things. And my last company, we supported over 10000 lab systems. And, you know, the corporate side of me says, can't we just standardize? Right.
If you go into any chemistry lab, you'll see a much HPL sees, you know, elegant, elegant, elegant. Eleven hundred twelve, hundreds different firmware is different connectivity. And whoever bought the scientist, they'll have a different licensing. So they have a different version of the software in each individual. Scientists will file, you'll save the file slightly differently. So it's very, very hard for us to. There's no such thing as a standard in the lab. And of course we have dozens of sites around the world and every scientist does things slightly differently. So even if two scientists are using the same instrument, they could be doing things completely different with the same instrument, for instance, of throwing a metal plate reader. They can have a different plate layout, different dilutions in the plate. And it makes really things difficult when we're trying to standardize that data and storing and networking from, you know, officially I came from corporate I.T.. And, you know, we could provide some storage to research scientists, but we were a cost center. So we were only given a certain amount of budget to be able to provide for our researchers. And if a researcher comes along and says, I could use 50 terabytes a week, like I don't I'm not budgeted for that. Yet we still have to support that. And that's our job as I.T. professionals to be able to support our scientists. So how do we do it? So ultimately, we want to move our data to the cloud and, you know, to not only just just move it for archiving purposes, but we found a lot of our scientists were trying to get get around I.T. and they have a job to do. They have data. They need to process it. But we as I.T. were not able to provide them petabytes of storage. So they just saved data locally. And I'll just save it locally, not tell Eyeteeth. They won't tell I.T. until the lab system's crashed. And, you know, five years down the road, I get a phone call. How come we didn't backup my p._c? I didn't know you had data on there. So times a thousand, we have a huge problem in our laps. We've estimated we have approximately a petabyte of hidden data in our labs. And right now we have a project ongoing to scoop all that up and move a database. So that's one of the huge projects right now. And then once it's in S3 or another storage tier, we can put that into tiring.
We can, you know, S-3 I-A, we can a cold storage and the glacier to make it even cheaper. And interestingly enough, once we centralize that, we can now make that available to our data science teams who can do data analysis and A.I. and that pascals and talk about that later about how we can do that now that we have the data in the cloud. Now, it's very, very hard, of course, to move that not only because our labs are challenging themselves, but the applications, some of our applications go back 20, 30 years and some of the vendors don't exist anymore. Celgene isn't that big. So we can't go to Agilent and say, give us covers and your software. That's just not going to happen. We estimate about 1 percent of our applications are cloud aware and those are Ingeus workloads. However, all those applications they own, they're hardcoded to save their data to a vendor's bucket. They won't allow us to save our data to our bucket. So it can be challenging working with the vendors. And also, we don't want to change our scientists. Right. They went to school. They are. It is their job to invent new products and new new cures for therapies. We don't want to say you have to do this because we're I-T. No, no. It's our job to help them do what they do best, and that is science. And we do have a problem with a challenge anyways with data transfer. Now we are getting terabytes and terabytes, every single data. How do we move that to the cloud? And I'll show you how we do that coming up. But real quick. This is really not a sales pitch. I know it starts off like that. But why do we use A.W. as storage? How come we're not just buying servers on premise 1? For us, it's a huge benefits, but no maintenance. When you upload your files to Amazon, they take care of that. You don't have to patch it. You don't have to deal with any viruses. Oh, you know, you still get those e-mails. System X is going down this weekend for upgrades.
Well, we don't have that problem. It's also great value. We cannot beat the cost of S3. It is so incredibly cheap relative to what we do on premise. We've gotten some pushback somewhat. We have some large projects that petabyte scale. And when we quote them, what we think it would cost to that on premise, I think, OK, we're good. So some of workloads you the storage alone is over a hundred thousand dollars a month. But when they see the multi-million dollar quote they get from corporate I.T.. All right, good. We believe you.
Another benefit. We didn't anticipate this when we first started in our cloud Gertie's integration, the cloud services. So I started to cloud a long time ago. And now the cool things like Athena, Unstaged Maker, didn't exist. So it was just a W.S. It was just the infrastructure play. But it's matured so much that we can do all this new cool stuff just by having the data there. So not only do we support the primary use cases and the computation by our data science teams can do so much more with it because it's centralized and the Amazon environment has these new cool tools. And of course, you'll see a lot of slides to say A.W. storage is infinitely scalable. That's absolutely true. And you know, you can consume and spend as much as you can take. However, what's more important to this is not really on there. It's infinitely scalable immediately. Right. So a few not not too long ago with a scientific project come to me and say, Lance, we could we could use 100 terabytes of storage.
And we were thinking we could I could we could do that. I could do that. What if the estimates were wrong then? Was that 100 terabytes was really two to three hundred every quarter. And that's for one collaborator out of five. So that quickly turned into that one hundred terabyte quest, turned into a couple of petabytes. And I think if we could turn that around in the two months, that would be great. So there's there's no way we could do that on premise in the cloud. We have this ability now with this that we can really operate the speed of science, know we're I.T. is no longer there. So if someone needs a petabyte. Here you go. Done.
You just got to pay for storage options. So this is how we see storage at A.W. us. And this is our preference. Like there's actually an order here. So our first preference, we've pushed people to S3 because it's super cheap, has a lot of security built in, has very, very high bandwidth, high latency, but it can work very, very fast. We have large clusters that go to 8000 nodes simultaneously. Accessing petabytes all from S3 is higher, make a higher latency. So it takes a little bit of getting used to. But it's object storage and a lot of traditional I.T. people have a hard time grasping what object storage is. It is different and a lot of on premise people are thinking I can just click on a link and it pops up all objects. Georgian. It isn't like that. You can't do a file. Edit I want to edit a file. You can't do. They have to download it, edit it and put it back. So it works very, very differently. So if we if we have a work load that can't use essery, our next option is ESFS. So this is a manage and infests type service and just mounted on your Linux workstation or your easy two instances or VM and it just works. You don't have to do anything with it. And there a I guess I ain't here now that you can save up 90 percent on the cost. So there are some cost saving options there. However, it is the Linux option. If you have Windows workloads, fsx for Windows works reasonably well, integrates with the active directory. And if those three don't work and we have a couple of use cases, we will then use b.b.'s storage to block storage that you can attach to and you can have any type of file system. However, you have to make the file system. You have to do your backups. You have to maintain the easy two instances. If they go down, someone has to go and fix them. So that's why we prefer the first three. But there are some use cases where we have to do UBS on D-C too. Now, there are two work, two options here that I didn't list. I'll go with them. There are edge cases. So, for instance, a federal storage. So many of the easy two families have SSD directly attached to the hardware. They have a. Mendoza Hundreds of thousands of iaps, however, after easy to expert, you will know that if you stop your virtual machine and when you spin it up, it could spin up on the other side of a data center and ÄŒssd that's attached to the hypervisor doesn't move. So if you save data to a federal store and you turn your server off, you're going to lose all your data. However, it's very useful for scrap stores. So we use it for our and just workloads. We use petabytes and petabytes every day. However, we don't use it for any sort of long term storage, not dissimilar as fsx for Lester. It seems like it's the same thing for ESA fsx for Windows. However, how it works is it grabs a copy of the data from S3, brings it to the cluster, presents it to a large HPC cluster. You can hit it with thousands and thousands of items at the end of the year competition run. That'll put the data back into S3. So it's not intended for a long term storage. It's intended for a very, very fast and experience.
But you have to have the Leicester client. So when we first started going into cloud our journey, I started the company and I had proposed a project that we don't go out and buy a petabyte storage. We should just use the cloud. My first thought in the first glance we're not putting our data on the Internet. That's so there is a bit of organisational aspect to the side. And so our first there are multiple layers when we put data into the cloud. It's not just you putting it on the Internet. There's a lot of protection in place. So how we are it? Celgene is a multiple project team. So a particular department or a project, they'll have a small, you know, Amazon Paulette's a pizza team that is reasonably self-contained. They have all the experience and knowledge to create a particular platform or a project. So they go through the architecture, all their own programing. They need some database help that's all self-contained within that project team. Now, on top of that or with them is the cloud team. So we make sure that you have good architecture. They're adhering to security practices, work very, very closely with the networking and security department. So there are certain services and certain functions that we blacklist that we don't allow individual projects to use. So, for instance, you know, we don't allow people to open up a VPC to the Internet. It's just not allowed. So we make sure that all security rules of all. And of course, we are now on top of us or below us, depending on your point of view. We rely heavily on the A.W. services for security. So what are most of our computations take place within the VPC? None of our v.p.'s are internet facing. They can only be access from on prem. So if there's a bunch of servers, they can't be hacked by the internet because they're not on the internet. So if they get, then you have to come to our company network, go up the fiber connection, direct, connect into our VPC and they're completely isolated from the internet. We also use security UBB to do a scanning on our logs, etc.. And the security hub has a tremendous amount of data that we don't have access to. For instance, full logs we don't. Some are accounts of logs on, some of them don't. But a security hub will see a bad logging attempt from a particular location that we normally don't come from. And it will let us know that some bad log and maybe you should do something about that. We all saw to date encryption, new GDP workloads coming online. And depending on the storage subsystem, you can enforce the apolicy encryption on premise. We have no paper policy you have to encrypt, but you can harden code that in the cloud you can say if your data is not encrypted, it cannot be uploaded. And of course we rely on data us the automation that we use of the cloud team as a company culture. But so we have automated processes that will see what people doing. Some creates a bucket. That's my next flight, I believe. So these are some of the rules. You don't have to take a picture of it. There we go on the left hand side there. These are some of the policies that we have depending on the workload. So if a G.D.P. are workload comes along, we'll put some of these policies on that account. If they create a bucket, automation automatically picks that up, applies these policies and they have to encrypt on the right hand side, for instance. We can put on a bucket policies that the bucket can only be accessed by a particular VPC. It can be accessed only by certain IP or B2C endpoint depending on what the work load needs. And in the middle we have a couple of workloads where users use S-3 as a home drive. So we have to keep people, you know, their workloads separate from others and that's the bucket policy we use for that sort of thing. There are, of course, the Sorge itself as one thing, but how to get the data to the cloud. And that's what the remainder of the talk is about. So if you were going to use S-3, one of our top areas that we use to transfer is the Storage Gateway Family, which includes data sync. So the storage gateway is an appliance that you could buy or download into verte VM where put it on your local network presents and network share. From your users point of view, it looks like a window share. You save your data to it. Boom! Pops up in lockup. Same thing with data. I think the difference is that it uses an existing network share. The benefit of that is you can have a larger share if you are coming from organization that doesn't quite object to where you are, but use STF T-P. You can use that eatables transfer for SRF TV data will automatically go into your bucket. Now the next couple of options there. So it's a lesson learned that you guys can take away from us. We thought this was gonna be the way that we would transfer data through the cloud. It didn't work out so well. So we thought because we're cloud experts, we kind of know what we're doing, right. This is what we normally do. And our data engineering teams are like, great. We use the SDK, the CLIA or some sort of Gooley to transfer data to the cloud. But when we rolled out child-bearing some other gooey tools to our end users. It did not take and we tried we tried for about a year like you should use this. This was this will get your data to the cloud. And the concept of object storage to our end user was alien after about a year. Maybe we should try something else. This is definitely not working on the slide. Do you see something that is crossed out and some people do it? We will caution you very strongly to not do this. It is an option. But please do not this some sort of fuze mount or mounting a necessary bucket as a file system. And 10 years ago. That's what we did. We thought we were being really cool, that you could have a mount point on Windows and Linux and just mount a bucket. The problem with this, how these work is they'll take your entire bucket list, think put that into memory. And if you want to edit a file, it downloads, the file puts into memory, you make your edit and then we uploads it all unbeknownst to the application. And if you're a small business, you have a little Excel file, that's fine. But our buckets are now hundreds of millions of objects. And when you try to cache that, the hundreds of millions in the memory that has a huge overhead and then our genomic files are 30 50 gigabytes each. This client is going to try to put that memory. And about a year ago, we will finally convince some of our clients not to do that anymore. The servers that had mounted these were crashing every five minutes. It was so bad that we had a crunch.
I'd just restart the service. But we finally condemns not to do it in this light that is unstable. You will almost guaranteed to have some data loss. If two clients have a bucket fuze mounted, one will make a change, the other won't see it. They will try to make a change simultaneously. They'll step on each other with that transfer. Acceleration is not a necessary method, but it's an additional option. You have to transfer data from across the world to a different bucket. So we turn this feature on and you have a different address for your S3 bucket. And he uses the S3 back end to transfer your data to a bucket and doesn't have to use your company's network. And the Amazon network is faster than ours, so it definitely helps about for long distances. Maybe a third to 50 percent improvement in transfer time. If you're using fsx again, the data sync client will work reasonably well. And there's also just digital mounting here. Your if s filesystem from on premise. Just our sync it up. Same thing with fsx and fsx. Also support source gateway. If you're running CBS on AC to the traditional Unix or SMB, copies from Windows worked just fine. Snowball we use snowball for transferring large say that we're time isn't necessarily a hindrance. So when you put in the order for a snowball, it takes at least a few days to arrive. You set it up, take into your datacenter, spend a couple of days copying your data over it. Close it up. FedEx person shows up, you send that off. So it's a minimum of a two week process, but each device can hold up to 100 terabytes. So it can be good if you're you know, you're trying to send some data over a T1 that's just not going to happen. Small ways to do it. And we did didn't used to use third party projects or products to copy data. Esparra and the NetApp, we had some issues with that. And we no longer is about if your organization uses them, they will absolutely work out. We found that a lot of these tools will require an easy to instance to be stood up in your account and it works with some sort of virtual machine on your local network and then transfer the data back and forth. However, a lot of these tools will encapsulate your data within some sort of proprietary binary. So they will say we will put your data necessary. That is absolutely true, but it can only be accessed by this particular client. If you're only doing a couple of files, that's fine. But if you're doing it with 8000 notes, that one, easy to instance, will become a huge bottleneck as additional expense and maintenance that you otherwise don't need. Now, getting your data is the cloud. It can be easy. Can be very, very hard. We do a lot of work with startups and academics and we see that small, small companies that we work with all go over the Internet. But if you're in this room, you're probably about tech, pharma, and that's not secure enough. Generally, if you're going over the Internet, security is by IP address, maybe some sort of log in and we don't for us, that's not secure enough. So we see some you know, once companies start to mature in the cloud journey, they'll do a VPN connection. So it still runs over the Internet. However, there's IP like tunnel between Amazon and yourself at encrypts all the data and it hides it from the bad people. Of course, there's the big kid in the room, the direct connect, and we strongly recommend this. It comes in different flavors. If your organization is looking at direct connect, they come in two major categories how we put it. Less than one gigabit and one gigabit it over. And it seems like, you know, one gig is one gigabit is just 10 times the one hundred megabit. Right? It seems like it. But that's not the case. Anything less than 1 gigabit only has one virtual interface. And if you're a network nerd, that will mean a very big deal because right off the bat, you're going to least to see you. If you buy 100 megabit connection, you'll need 200 megabit connections. Now you're that alone is more than the price of a single gigabit connection. A gigabit and a 10 gate connection comes with 50 virtual interfaces and wondering what's the virtual interface? Every VPC that you create will need a virtual interface as well as you'll need a second or more interfaces for public services like S3. Now, if you have Transit Gateway, there's other other topology that you won't necessarily need more than a few, but you're going to need at least two. So that's why I recommend one gig get up. And you also get the additional benefit of having faster speed. This is how we connect. So like I said, we are a global company with sites all around the world. I think combined we have about 100 accounts and about 200 v.p.'s, give or take and different workloads work with other workloads.
So we separate out our accounts into different projects and departments. So one department will have their costs in another account, their VPC, but they work very, very closely with another department. And their view is that they need to exchange data, but they may be in different regions. So we have you, our network, we have a v.p.'s mesh that combines all of our sites so any site can get to any VPC and any different region. And so if we were to do this form a point to point, a direct connect for each site, we would need over 30 direct connects instead. We don't want that mess. So instead of every site having a different connection to a one of the regions that we use, they just connect to the mesh and we have a connect culo facility where all of the network areas and we have a connection from the Equinix facility to Amazon. So what you see on the site here are the sites at the bottom that connect to a little mesh there, go to a customer cage. We have maybe a few hundred meter fiber optic cable that goes the Amazon cage and then therefore goes into the Amazon network. And you see the two purple icons of the top one for the public endpoints and one for the VPC. But we have, you know, many, many VPC. So on the right hand side, you see a lot of those and bring it all together. This is how our architecture looks like from on premise all the way to Amazon. So on the left hand side, you'll see our research scientists, they go to their bench, do their experiment, load some plates, go to the laboratory, p.c hit files, say, do some preliminary analysis, what have you. And that's all they that's all they know that works. And from then on, all the automation kicks in, they hit files. When they hit file save, it goes to the local file server. And for us, that's a NAS device at NetApp. We specifically pick the NetApp. We didn't pick a Windows server. We didn't pick a Linux server because NetApp supports multiple protocols on the same share or exporter. So that's at best a huge win for us. We can have Apple, we can have Linux, we can have Windows all work off the same directory and then those files and then picked up by our data sync application that's running on a VM, which then moves it up with the eight of US Data Sync Project or the service and it pops it directly into S3. Now you can also you store Gateway. So we use a NetApp also because it's big. I saw Arnav is over a petabyte multiple controllers redundant. But if you're only doing plate readers, HPL sees you don't quite have the Ingeus workloads. You don't need to. And you don't need to have a large, you know, petabyte system and get by. And I saw Gateway maximum size to 16 terabytes. So if you're workloads can fit that within the day, you can absolutely do that. And so far, easier and cheaper, too. So once our data gets S-3, it can kick off all these other workloads that we have. So, you know, we have some users using Athena. For spark queries sage maker if you're doing a-I all in the same bucket. Of course, our CEOs on the top, you know, we struggled in the beginning working with the CEOs. They didn't want. They only wanted to FedEx harddrives to us. And then we said, you should really go to S-3. 80 percent of our zeroes now support S-3. They send it directly to our Bukkit and we're happy. We have a couple that are kind of resisting, but they support us, FPP. So we don't have to maintain an NFL server or any FPP server for them. We say here your credentials and you can automatically upload into ESSERY, yet you don't have to be SVO. And with that next up, we'll be PESCO And you'll talked about digital pathology with the data that we have the necessary. Free-lance.
Good afternoon, my name is Press Costarring. I'm the I.T. director for translational research at Celgene. Now be a mess. The infrastructure that Lance just outlined and that we built at Celgene enables us to share information and data with our colleagues and collaborators all over the world. And one of the applications that takes great advantage of that infrastructure is our digital pathology. Pathology is mainly concerned with Fino typing or classifying tissue and cell types of patients. And a workflow that we have created is the images are stained and labeled, so you apply a color to highlight the tissue or the presence of a molecule in a cell in the image. And this essay development happens up front and the images are then scanned. Tippett Typically using a microscope and these stains will show up in different colors. We combine these colors in an image where we can investigate what has been labeled typically multiple pathologies. Experts in reading these images are involved in analyzing the state of the sample and therefore help determine the course of treatment for the patient. And depending on their availability, they can be all over the country. So not too long ago, less than two years ago, we would store save these images on a disk drive and ship it to the pathologists. That would take about two weeks. And then we would have to create a session with them where they would all look at the images. And then in a collaborative session, they would analyze the images. Of course, with the with the infrastructure that we have now, we can bypass that shipping of the hard drives to the pathologists and basically directly from the microscope, we upload them to an S3 bucket and then we bypass that step. That takes about two weeks, which is time that is critical for the patient and pathologists and experts all over the country cannot collaborate real time with special tools that that are available to manipulate the images, not just to look at them, but to pan, to crop, to zoom in, to annotate and to analyze these images. So what I'm going to explain here is what can we do next? So now we have these images with the pathologists. Can we do more, for example, can we help them in predicting the phenotype of these tissues and cells? Now, I must highlight that, but I'm about to explain. Here's a very simplified process of a larger effort that we do. It's at Celgene. Greater scope of the analysis and research is much broader and deeper than I'm explaining here. But the approach is this is kind of the same. So what we want to do is on the top left you see a raw image. We want to find the cells in his image. We want to do a segmentation. And then the regions of the cells, we want to extract some features. We have about one hundred and thirty plus features that we have defined for each of these cells. And we want to know if we can use that to teach or to to create in a model that can help us in phenotype prediction. So one approach is to use commercial software. So every vendor will ship you commercial software that will help you with segmentation of cells in an image, but they take the middle of the road. So they're not they're not specific to the cell types you want to see. In general, they overestimate the number of cells. A cell boundaries are too wide. So what we want to do is can we create our own process based on machine learning and adapt that to our cell types? So how could a I aid in the process? So what we're looking at is natural object segmentation, where there is a pixel wise assignment of a class to these pixels to identify the objects. So on the left side you see a famous image of 3 object, image 3 object classes 1 is the person on the bike and one is the background.
On the right side, you see a different type of cell image that we have fed through our algorithm, which is a picture classification algorithm which is explained in much more detail in other sessions to arrive at a an output that outlines these cells. So why do you want to use deeply, deeply in image analysis? I thought this was a kind of a fun slide to look at. If you look at the error rate of over the years in image analysis before 2012, before machine learning became to the stage, it was all based on color, classical computer vision. And these methods had an error rate of about 20 to 30 percent. Once we introduced machine learning and deep learning and learned how to use it and how to improve on it. You see these error rates drop drastically. And in about 2015, it started to become a better than human color classification. So that's why we want to use that. If you look at the images that we want to process as a future, a challenge is that we have to take into account. One of them is that so many different types of so cells that you want to image, so many different types of.
Experimental conditions that is hard to find. The training date data be more specifically to find the ground truth of these immature types. Granted, there are now lots of teams that are looking at this in the world and they are publishing the efforts around it. So more and more GroundTruth Truth becomes available for these types of images.
The pics of value of variation, the great value is also a concern. These are sixing but emit images and basically you see cells in the deep dark to the bright gray.
Another typical issue. A tumor cells, instead, they tend to cluster. So there is not enough background to to segment them away from the. From the background, an eye and identify the cells. And the last one that we need to take into account is the sheer number of cells. As you can see here, there's thousands of them. So the first thing that we tried is what about all the models that are out there that have been trained? So we tried a mask. Our CNN, which is a region based convolutional neural network with the rest, not 50 backbone. And as you can see, the results are kind of disappointing. It doesn't find all the cells. It cannot separate the cells that are touching. But it's actually not that surprising because it wasn't trained to do this. So we had to do this ourselves.
And the tools that we use, we use, say, to make her. And then the tools wouldn't say its maker is psychic image, which is a python based image processing package, open CV, which is an open source computer vision, library of classical image analysis algorithms and pie towards RateCity. Python. Machine learning.
So if we look at the segmentation, we kind of divide it up in two steps. This is again, this is a simplified approach. The first thing that we apply is a semantic segmentation using unit, which by itself is a CNN based segmentation for medical images. And this is to separate the cells from the backgrounds. We do not. We do not separate the salesman to her eye touching and then and then the next step is a cell detection using a faster our CNN algorithm, which is sensitive region based. It draws the bounding boxes around the objects. These two together we combine to create the mask of the cells.
So this is the result of the segmentation. So on the left side, you see the image that we used in our example. And on the right side you see the result. Yellow is so purple is the background. Fast as our CNN detection draws boundaries around all these cells. Now this is hard to see. So if you zoom in a bit, you see all these bounding boxes around the cells. If you take a closer look, you actually see that it finds the cells that are touching. It will separate us out based on the characteristics of the model. Now, these two together, the the object mask and the padding boxes we combine using a micro base of a watershed to find the actual outlines of the cells and on the right side you see the results there. Now if we compared it to defend a software result on the left side and I don't know which one is better. So I asked our our scientist. And if you zoom in a little bit, you see that our cells are more uniformly shaped. Defend the cells or are bigger. There's more of them. So there's more false positives. Now, sometimes cells aren't around. So it's it's not a good thing to find around old all the time. But if you look at these highlights in the red boxes, you see the difference between the two methods. In this case, they find the same cells. I think that the cells that we find have a better boundary. So now that we have the masks, we going to extract the features based on these masks for T cells and then feed it into our machine learning to do a phenotype prediction. So the tools that we use there are simple scan and speechmaker. We use pure python and psyche to learn package.
And I'll go back to this image. This is actually our test image where we have a population of cells of which some of them are C3PO positive and we put in a label. So what we're asking the machine learning to do is based on the features that it created. Point out which cells are seeded three positive. So all these. Features are being extracted for every region of the cell that we find. So this is a very sparse image. You have to remember what we showed in the previous slide where we have many, many, many more. So 130 plus features for each of these regions. And we fit and we fed those into a number of machine learning algorithms and we ranked them based on the on the accuracy compared to what a human would classify as a C3. Now, it is by no means is optimized, but you can see that the first five methods for methods have an accuracy of a 95 percent or more, which is amazing compared to what we had before with with computer vision.
So the lasso regression algorithm seems to perform the best. The final step that we did was OK. So if we do this using a principal component analysis with Tiffany. Which of these features actually contribute to most of the accuracy of of this that in 96 percent so we rank order them. And it turns out that the first 10 or so have have the biggest contribution in this accuracy. So you can do a trade off in terms of computation. Effort to calculate all these features for every cell with a little bit of accuracy and be much more effective. So here we have it. We have all these images that are being delivered to the pathologists anywhere in the world who can access these images. They have tools to look at them. And on top of that, we can highlight these images, we can annotate these images with regions of interest that say, hey, this is a part of the cell that is highly positive for in this case. City three that might you know, you you might want to take a look at it.
I can see in a future where the role of these l of these algorithms are become become much more prominent in that it will only involve the pathologies. If there's an issue that has been detected. And for all the other for all the other analysis, it it will take the machine. Results. So that's where we are at this point. I'm going to hand it over to.
So thank you very much. And I would like to remind you of the other life science sessions that we have, as well as there's a health care life science lounge at the in the hallway and then you take a left and we will be there afterwards to to answer questions as well. But we also have the opportunity of a microphone setup here. And I'll I'll run around in the side room. And we are available to take any questions of the audience right now. So if anybody has any questions, we're happy to answer.
Raise your hands. There's a microphone right up here if you want to come up or. Just that way, I can hear you if you don't mind. Thank you.
Hi. Quick question for Lean. Your architecture didn't have databases at all. Do you are using databases, redshift or stuff like that? Do they play a role as well?
Yeah. We're a big proponent of database as the service as well. So already, yes, we have a number of workloads using, you know, already as my school, a little bit of war depending on what your needs for recovery are. We have, I think a couple of use cases on RE DV which is the functional equivalent as my sequel. And we also have a few workloads using Dynamo, but with the discovery we don't have any redshift I think. Our sales folks are using redshift, but today's topic was primarily on getting the file data out of the labs. But we are yeah, we are using some database services.
Any other questions? OK. Like I said, I do encourage you. There is the health care life sciences lab. We'll be down there to answer any questions or just generally meet everybody. I once again, I just really thank you all for coming and listening and have a good evening.