Disclaimer: Automated Transcript

00:00

hello everybody thank you very much for

00:02

coming today so we're gonna we're going

00:06

to talk a little bit today about the

00:07

labs of the future and and and certain

00:10

things that Celgene is doing within life

00:12

sciences so we have an agenda we're

00:16

going to work on off of today we've got

00:18

a few introductions we want to do and

00:20

then we're going to move into two phases

00:22

one of them is going to be how we

00:25

integrate AWS and in the laboratory

00:27

space and then the last example is using

00:31

digital pathology to see what we can do

00:32

with this technology so today with this

00:38

we have a very early adopter and a

00:41

long-term customer of AWS Celgene

00:44

they're currently storing and processing

00:46

and analyzing about 20 petabytes of data

00:49

they're working on three major R

00:53

initiatives with us one is the melanoma

00:56

genome project the other words are on

00:58

computational chemistry and then AI

01:02

workloads they've recently done a merger

01:05

with a another very long-term and good

01:07

customer bristol-myers Squibb this new

01:10

company hopes to become the leader in a bio

01:13

pharmacy so today we have two speakers

01:18

with us one is Lance Smith who's

01:22

The director of R&D; cloud and HPC and the

01:27

other is Pascal star Nick who is the senior

01:29

Director of translational computation so

01:33

both of our long-term IT professionals

01:37

and research scientists and and come

01:40

from the Celgene side of the world so

01:45

what are we facing today and where the

01:47

challenges that these labs are facing

01:49

and we've kind of broke it into four

01:51

parts one we see a bifurcation that data

01:54

between the wet lab and the

01:57

computational space but in reality this

02:00

is a unified workload that needs to be

02:01

data-driven so you'll do your research

02:04

in your wet lab then you'll run it

02:05

through different instrumentation and

02:07

this instrumentation produces large

02:10

amounts of data that we then have to do

02:12

analyze and run we have found that we've

02:18

gotten most of our low-hanging fruit

02:20

that we can get and we need to move to

02:25

more challenging targets and we need to

02:26

come to the concept of a batch size of

02:28

one so there's a need for data-driven

02:31

techniques that that's just then then

02:35

using that data to optimize those

02:37

workflows and then finally we have

02:40

multiple streams of data coming in and

02:42

that needs to be combined and then

02:43

tagged so it's it acts as one cohort and

02:50

what we're hearing from the

02:51

organizations is is that we have kind of

02:54

four major major problems or major

02:56

challenges they're also facing we have

02:59

an inability to aggregate and share this

03:02

data we're collecting both internally

03:04

but also with the researchers and peers

03:06

externally we then have these data

03:10

streams coming in we need to combine

03:12

them and then more importantly we need

03:14

to tag them and catalog them so we can

03:16

actually make them actionable and then

03:18

ability to go back and do look backs so

03:21

that when we were looking for different

03:22

methods or different paths and research

03:24

they're there to search for these

03:27

workflows unfortunately are not

03:29

optimized there is there especially when

03:33

it comes to some critical lab equipment

03:35

that much of this lab equipment has very

03:37

large capital investments with it and

03:40

then the thing that I think as

03:43

people and as employees we find most

03:46

troublesome is the repetitive low value

03:49

manual tasks that these researchers have

03:51

to do every day in and out in and out

03:53

then if we can actually move some of

03:55

these tasks an automated fashion and

03:57

take that workload off of them it allows

04:00

them to do what they actually got into

04:01

science to do and actually understand

04:03

research so the first thing we're going

04:07

to kind of move into is we want to look

04:10

at the cloud integration and how we've

04:12

integrated these labs with that and with

04:15

that I'd like to hand this over to -

04:17

Lance thank you Sam

04:20

turn in our good evening everyone my

04:23

name is Lance Smith I'm from Celgene now

04:26

part of the VM

04:26

family so we're we were excited to what

04:29

the future holds for us we are a global

04:31

pharmaceutical company primarily

04:33

especially on the oncology and

04:34

immunology and we p drug discovery to

04:37

clinical to manufacturing and sales so

04:39

it's really challenging environment that

04:42

we have you know R&D; and commercial all

04:45

in the cloud and how do we keep them

04:46

separated yeah how do we empower our

04:48

scientists to be able to do all this new

04:50

cool stuff in the cloud now we have this

04:52

global footprint which also makes it

04:54

very very tricky for sending our data

04:56

from on-premise from our wet labs into

04:58

the cloud but the good thing is we have

05:00

a cloud for strategy that's just not our

05:02

CIO saying we have a cloud first

05:04

strategy but we have about 50 or 60

05:06

people now attending reinvent just to

05:08

learn so IT people data scientists

05:11

engineers and actually bench scientists

05:14

are here to learn how they can do their

05:16

data processing in the cloud so this is

05:18

this is the big thing first and we

05:20

really you know fully believe in AWS as

05:23

you know the biotech lab environment if

05:26

you're in this room you know what it is

05:27

it's very very complicated it's very

05:29

hard to put into the cloud it's solve

05:32

this really really messy wet

05:33

experimental stuff yet we want all this

05:36

data to move from one location

05:37

centralized to the cloud so make the

05:40

processing much easier however if you

05:42

walk into a lab you'll see it at a

05:45

startup you'll see dozens hundreds of

05:47

instruments you become a cell gene

05:48

you'll see thousands of these things and

05:50

my last company we supported over 10,000

05:52

lab systems and you know the corporate

05:56

side of me says can't we just

05:58

standardize right if you go into any

06:01

chemistry lab you'll see you know much

06:03

HPLC is you know elgin Elledge elegant

06:06

1,100 to 1,200 different firmware is

06:10

different connectivity and whoever

06:12

bought the scientist they'll have a

06:14

different licensing so they have a

06:15

different version of the software and

06:17

each individual scientists will file

06:19

I'll save the file slightly differently

06:21

so it's very very hard for us to there's

06:23

no such thing as a standard in the lab

06:25

and of course we have dozens of sites

06:27

around the world and every scientist

06:29

does things slightly differently so even

06:31

if two scientists are using the same

06:33

instrument they could be doing things

06:35

completely different with the same

06:37

instrument for instance observing a

06:38

little plate reader they can have a

06:40

front plate layout different dilutions

06:43

in a plate and makes really things

06:45

difficult when we're trying to you know

06:47

standardize that data and storing and

06:49

networking you know from the you know

06:51

officially I came from corporate IT and

06:53

you know we could provide some storage

06:56

to our research scientists but we were a

06:59

cost center so we were only given a

07:01

certain amount of budget to be able to

07:03

provide for our researchers and if a

07:05

researcher comes along and says I could

07:06

use 50 terabytes a week like III don't

07:09

I'm not budgeted for that yet we still

07:12

have to support that and that's our job

07:13

as IT professionals to be able to

07:15

support our scientists so how do we do

07:17

it so ultimately we want to move our

07:19

data to cloud and you know to not only

07:22

just just to move it for archiving

07:24

purposes but we found a lot of our

07:26

scientists we're trying to get

07:28

around IT and they have a little job to

07:31

do they have data they need to process

07:33

it but as IT were aren't not

07:36

able to provide them petabytes of

07:37

storage so they just save data locally

07:39

you know I'll just save it locally not

07:41

tell IT they wouldn't tell IT until the

07:44

lab systems crashed and you know five

07:47

years down the road I get a phone call

07:50

how come you didn't back up my PC like I

07:52

didn't know you had data on there so

07:55

times a thousand we have a huge problem

07:57

in our labs we estimated we have

07:59

approximately a petabyte of hidden data

08:02

in our labs and right now we have a

08:04

project on going to scoop all that up

08:06

and move it AWS now so that's more of a

08:08

huge projects right now and then once

08:10

it's in s3 or another storage tier we

08:13

can put that into tiering we can you

08:16

know s3 I a you can a cold storage and

08:19

glacier to make it even cheaper and

08:21

interestingly enough once we centralize

08:24

that we can now make that available to

08:26

our data science teams who can do data

08:28

analysis and AI and impasse scales gonna

08:31

talk about that later about how we can

08:33

do that now that we have the data in the

08:34

cloud now it's very very hard of course

08:37

to move that not only because our labs

08:40

are challenged in themselves but the

08:41

applications some of our applications go

08:44

back 20 to 30 years and some of the vendors

08:47

don't exist anymore

08:49

Celgene isn't that big so we can't go to

08:51

agile it and say give us cloud

08:53

your software that's just not gonna

08:54

happen we estimate about 1% of our

08:56

applications are cloud aware and those

08:58

are our NGS workloads however all those

09:02

applications they own their hard-coded

09:05

to save their data to a vendors bucket

09:08

they won't allow us to save our data to

09:10

our bucket so it can be challenging

09:13

working with other vendors and also we

09:16

don't want to change our scientists

09:18

right they went to school they are you

09:20

know it is their job to invent new

09:21

products and new new cures for therapies

09:23

we don't want to say you have to do this

09:26

because we're IT no no it's our job to

09:28

help them do what they do best in that

09:31

is science and we do have a problem with

09:34

a challenge anyways with data transfer

09:36

now we were generating terabytes and

09:37

terabytes every single day now how do we

09:39

move that to the cloud and I'll show you

09:42

how we do that coming up

09:44

but real quick this is really not a

09:46

sales pitch I know it starts off like

09:47

that but why do we use AWS Storage how

09:49

come we're not just you know buying

09:50

servers on premise one for us it's a

09:53

huge benefits but no maintenance when

09:55

you upload your files to Amazon they

09:58

take care of that we don't have to patch

09:59

it you don't have to deal with any

10:01

viruses oh you know you still get those

10:03

emails system X is going down this

10:05

weekend for upgrades or we don't have

10:07

that problem it's also a great value we

10:10

cannot beat the cost of s3 it is so

10:13

incredibly cheap relative to what we do

10:16

on premise we've gotten some pushback

10:18

some we have some large projects I do

10:20

petabyte scale and when we quote them

10:23

what we think it would cost do that on

10:25

premise I think okay we're good so

10:27

summer workloads you know the storage

10:28

alone is over a hundred thousand dollars

10:30

a month but when they see the

10:31

multi-million dollar quote to get from

10:32

corporate IT alright we're good we

10:34

believe you

10:35

another benefit we didn't anticipate

10:38

this when we first started with in our

10:40

cloud Jordi's integration the cloud

10:41

services so I started to cloud a long

10:43

time ago and you know the cool things

10:45

like Athena and sage maker didn't exist

10:47

so it was just AWS was just the

10:49

infrastructure play but it's matured so

10:51

much that we can do all this new cool

10:52

stuff just by having the data there but

10:55

not only do we support the primary use

10:58

cases and the computation by our data

11:00

science teams can do so much more with

11:02

it because it's centralized and the

11:04

Amazon environment has these

11:05

cool tools and of course you'll see a

11:08

lot of slides to say AWS Storage is

11:10

infinitely scalable

11:11

that's absolutely true and you know you

11:13

can consume and spend as much as you can

11:16

take however what's more important to

11:18

this it's not that's really on there

11:20

it's infinitely scalable immediately

11:23

right so a few not not too not too long

11:26

ago when a scientific project come to me

11:28

and say Lance we could we could use a

11:31

hundred terabytes of storage and we were

11:33

thinking oh we could we could do that we

11:36

could do that what if what's wrong then

11:39

was that 100 terabytes was really two to

11:42

three hundred every quarter and that's

11:47

for one collaborator out of five so that

11:49

quickly turned into that 100 terabyte

11:51

and quest turned into a couple of

11:52

petabytes and I think if we could turn

11:54

that around in two months that would be

11:56

great

11:56

so there's there's no way we could do

11:57

that on premise in the cloud we have

11:59

this ability now with this that we can

12:02

really operate the speed of science you

12:04

know we're IT is no longer there so

12:06

someone needs a petabyte here you go

12:07

done you just got to pay for it storage

12:12

options so this is how we see storage at

12:16

AWS and this is our preference like this

12:18

there's actually order here so our first

12:21

preference we push people to s3 because

12:24

it's it's super cheap has a lot of

12:26

security built in has very very high

12:28

bandwidth it is kind of high latency but

12:30

it can work very very fast we have large

12:33

clusters that go to 8,000 nodes

12:35

simultaneously accessing petabytes all

12:37

from s3 it is higher mate a higher

12:41

latency so it takes a little bit of

12:42

getting used to but it's object storage

12:45

and a lot of traditional IT people have

12:49

a hard time grasping what object storage

12:51

is it is different and a lot of

12:53

on-premise people are thinking I can

12:54

just click on a link and it pops up Oh

12:57

object storage in this isn't like that

12:58

you can't do a file edit I want to edit

13:00

a file you can't do they have to

13:02

download it edit it and then put it back

13:03

so it works very very differently so if

13:06

if we have a workload that can't use s3

13:08

our next option is EFS so this is a

13:12

managed NFS type service you just mount

13:15

it on your Linux workstation or your

13:17

your ec2 instances or BMS and it just

13:20

works you don't have to do anything with

13:21

it and there's a ii8 here now that you

13:25

can save up to 90% on the cost so there are

13:29

some cost saving options there however

13:31

it is a Linux option if you have Windows

13:33

workloads fsx for Windows works

13:35

reasonably well integrates with the

13:37

Active Directory and if those three

13:40

don't work and we have a couple of use

13:42

cases we will then use EBS storage it's

13:45

a block storage that you can attach in

13:47

you can have any type of file system

13:49

however you have to make the file system

13:51

you have to do your backups you have to

13:53

maintain the ec2 instances if they go

13:55

down someone has to go and fix them so

13:57

that's why you prefer the first three

13:59

but there are some use cases where we

14:01

have to do EBS on ec2 now there are two

14:05

work two options here that I didn't list

14:07

I'll go over them there are edge cases

14:10

so for instance the ephemeral storage so

14:13

many of the ec2 families have SSDs

14:16

directly attached to the hardware they

14:19

have a tremendous hundreds of thousands

14:21

of IUP's however if your ec2 expert you

14:24

will know that if you stop your virtual

14:26

machine and when you spin it up it could

14:28

spin up on the other side of a data

14:29

center and the SSD that's attached to

14:31

the hypervisor doesn't move so if you

14:34

save data to an ephemeral storage and you

14:35

turn your server off you're gonna lose

14:37

all your data however it's very useful

14:40

for scratch stores so we use it for our

14:42

NGS workloads we use petabytes and

14:44

petabytes every day however we don't use

14:46

it for any sort of long-term storage not

14:48

dissimilar is fsx for lustre it's it

14:51

seems like it's the same thing for fsx

14:54

for Windows however how it works is a

14:56

grabs copy the data from s3 brings it to

15:00

the cluster presents it to a large a

15:02

specie cluster you can hit it with

15:04

thousands and thousands of ions and at

15:07

the end of their computation around

15:08

it'll put the data back into s3 so it's

15:11

not intended for a long-term storage

15:12

it's intended for a very very fast

15:14

NFS experience but you have to have the

15:17

lustre client so when we first started

15:21

going with the cloud our journey I

15:23

started the company and I have proposed

15:26

to project that we don't go out and buy

15:29

a petabyte of storage we should

15:30

to use the cloud my first thought in the

15:33

first glance we're not putting our data

15:35

on the internet so there is a bit of an

15:39

organizational aspect to the slide and

15:42

so our first there are multiple layers

15:44

when we put data into the cloud it's not

15:46

just you putting it on the internet

15:47

there's a lot of protection in place so

15:49

how we organize its elegy Nazz there

15:51

multiple project teams so a particular

15:52

department or a project they'll have a

15:55

small you know as Amazon pilots a pizza

15:57

team that has reasonably self-contained

15:59

they have all the experience and

16:01

knowledge to create a particular

16:03

platform or a project so they go through

16:06

the architecture so all their own

16:08

programming they need some database help

16:10

that's all self-contained within that

16:12

project team now on top of that or with

16:15

them is the cloud team so we make sure

16:17

that you know there have good

16:19

architecture they're adhering to

16:21

security practices we're very very

16:22

closely with the networking and security

16:24

department so there are certain services

16:27

and certain functions that we blacklist

16:29

that we don't allow individual projects

16:31

to use so for instance you know you know

16:33

we don't allow people to open up a V PC

16:36

to the Internet it's just not allowed so

16:38

we make sure that all security rules are

16:41

followed and of course we and now on top

16:43

of us or below us depending on your

16:45

point of view we rely heavily on the AWS

16:48

services for security so a lot of our

16:52

most of our computations take place

16:55

within the V PC none of our VPS are

16:57

internet facing they can only be

16:59

accessed from on-prem so if there's a

17:01

bunch of servers they can't be hacked by

17:03

the internet because well they're not on

17:04

the internet so that they get then you

17:06

have to come to our farm company network

17:07

go up the fiber connection Direct

17:10

Connect into our V PC and they're

17:11

completely isolated from the Internet we

17:14

also used security hub to do AI scanning

17:17

on hours or logs etc and the security

17:20

hub has a tremendous amount of you know

17:23

data that we don't have access to for

17:25

instance for logs we don't some of our

17:27

counts that flow Oggs

17:28

on some of them don't but security hub

17:31

will see you know a bad login attempt

17:34

from you know a particular location that

17:37

we normally don't come from and now

17:38

it'll let us know that we see some bad

17:40

logins maybe you should do something

17:41

about that we also wanted a tanker

17:43

new gdpr workloads coming online and

17:46

depending on the storage subsystem you

17:48

can enforce the policy encryption you

17:52

know on-premise we have you know paper

17:53

policy you have to encrypt right but you

17:55

can harden code that in the cloud you

17:58

can say if your data is not encrypted

18:00

it cannot be uploaded and of course we

18:02

you know we are native us the automation

18:04

that we use as the cloud team is a

18:05

company called turbot so we have

18:08

automated processes that will see what

18:11

people doing is something creates a

18:12

bucket that's my next slide I believe

18:14

yeah so these are some of the rules you

18:15

don't have to take a picture of it there

18:19

we go so on the left hand side there

18:21

these are some of the policies that we

18:22

have depending on the workload so if a

18:26

gdpr

18:27

workload comes along we'll put some of

18:29

these policies on that account and if

18:30

they create a bucket automation

18:32

automatically picks that up applies

18:34

these policies and they have to encrypt

18:37

on the right hand side for instance we

18:39

can put on our bucket policies that the

18:41

bucket can only be accessed by a

18:43

particular VPC it can be access only by

18:45

certain IP or BTC endpoint depending on

18:48

what the workload needs and in the

18:49

middle we have a couple of workloads

18:52

where users use s3 as their home drive

18:55

so we have to keep people will you know

18:57

their workloads separate from others and

18:58

that's the bucket policy we use for that

19:00

sort of thing there are of course the

19:05

storage itself is one thing but how to

19:07

get the data to the cloud and that's

19:10

what the remainder of the talk is about

19:11

so if you were going to use s3 one of

19:14

our our top areas that we use to

19:16

transfer is the storage gateway family

19:18

which includes data sync so the storage

19:19

gateway is an appliance that you can buy

19:21

or download into vertical um we're put

19:24

it on your local network it presents a

19:26

network share from your users point of

19:29

view it looks like a window share you

19:31

save your data to it boom pops up in the

19:33

cloud same thing with Data Sync the

19:34

difference is that it uses an existing

19:36

network share the benefit of that is you

19:38

can have a larger share if you are

19:41

coming from organization that doesn't

19:43

quite object aware yet but use SFTP you

19:46

can use the AWS transfer for SFTP data

19:49

will automatically go into your bucket

19:50

now the next couple of options there so

19:53

it's a lesson learned that you guys can

19:56

take away from us we

19:57

thought this was going to be the way

19:58

that we would transfer data to the cloud

20:00

it didn't work out so well so we thought

20:03

because we're cloud experts we kind of

20:05

know what we're doing right this is what

20:06

we normally do and our data engineering

20:08

teams they're like great we use the SDK

20:10

the CLI or some sort of GUI to transfer

20:12

data to the cloud but when we rolled out

20:15

Kyle barring some other GUI tools to our

20:17

end-users it did not take and we tried

20:20

we tried for about a year like you

20:22

should use this this was this will get

20:24

your data to the cloud in the concept of

20:27

object storage to our end user who's

20:28

alien and after about a year like maybe

20:31

we should try something this is

20:32

definitely not working on the slide

20:36

you'll see something that is crossed out

20:37

and some people do it we will caution

20:40

you very strongly to not do this it is

20:43

an option but please do not do this as

20:45

some sort of fuse mount or mounting an

20:47

s3 bucket as a file system and you know

20:50

ten years ago that's what we did we

20:52

thought we were being really cool that

20:53

you could have a mount point on Windows

20:55

or Linux and just mount a bucket the

20:58

problem with this how these work is

21:00

they'll take your entire bucket list Inc

21:02

put that into memory and if you want to

21:04

edit a file it downloads the file puts

21:06

into memory you make your edit and it

21:08

reads it all unbeknownst to the

21:10

application and if you're a small

21:12

business you have a little Excel file

21:14

that's fine but our buckets are now

21:16

hundreds of millions of objects and when

21:18

you try to cache that they're hundreds

21:20

of millions into memory that has a huge

21:22

overhead and then our genomic files are

21:25

you know 30 50 gigabytes each this

21:27

client is gonna try to put that memory

21:29

and about a year ago we finally

21:32

convinced some of our science not to do

21:33

that anymore

21:33

the servers that had mounted these were

21:36

crashing every five minutes feel so bad

21:37

that we had a cron job just restart the

21:39

service but we finally convinced all not

21:41

to do it and this lock that is unstable

21:44

you are almost guaranteed to have some

21:47

data loss if two clients have a bucket

21:52

infused mounted one will make it change

21:54

the other one see it they'll try to make

21:55

it change simultaneously and they'll

21:57

step on each other with that transfer

21:59

acceleration is is not a necessary a

22:02

method but it's an additional option you

22:04

have to transfer data from across the

22:06

world to a different bucket so you turn

22:08

this feature on and you have a different

22:09

address

22:10

for the s3 bucket and he uses the s3

22:12

back-end to transfer your data to a

22:14

bucket and doesn't have to use your

22:16

company's network and the Amazon Network

22:18

is faster than ours so it definitely

22:20

helps about for long distances maybe a

22:22

third the 50 percent improvement in

22:24

transfer time if you're using EFS again

22:29

the Data Sync client will work

22:30

reasonably well and there's also just

22:34

drizzle mounting your EFS file system

22:37

from on-premise they just are syncing up

22:39

same thing with FSX and you can fsx

22:42

also support storage gateway if you're

22:45

running EBS on ec2 the traditional unix

22:48

or SMB copies from windows work just

22:51

fine

22:52

snowball we use snowball for

22:55

transferring large data where time isn't

22:57

necessarily a hindrance so when you put

23:01

in an order for a snowball it takes at

23:03

least a few days to arrive you set it up

23:05

they can to your datacenter spend a

23:07

couple days copying your data over you

23:09

close it up

23:09

FedEx person shows up and you send it

23:11

off so it's a minimum of a two week

23:13

process but each device can hold up to a

23:16

hundred terabytes so it can be good if

23:17

you're you know if you're trying to send

23:19

some data over a t1 that's just not

23:21

gonna happen

23:22

some other ways to do it and we did it

23:26

used to use third-party projects or

23:28

products to copy data espera and the net

23:31

app we had some issues with that and we

23:34

no longer user but if your organization

23:35

uses them they will absolutely work out

23:38

we found that a lot of these tools will

23:40

require an ec2 instance to be stood up

23:43

in your account and it works with a

23:46

there's some sort of virtual machine on

23:47

your local network and then transfer the

23:49

data back and forth however a lot of

23:52

these tools will encapsulate your data

23:53

within some sort of proprietary binary

23:55

so they will say we will put your data

23:58

in s3 that is absolutely true but it can

23:59

only be accessed by this particular

24:02

client if you're only doing a couple

24:03

files that's fine but if you're doing it

24:05

with a thousand notes that one ec2

24:08

instance will become a huge bottleneck

24:09

and as an additional expensive

24:11

maintenance that you otherwise don't

24:12

need now getting your data to the cloud

24:16

it can be easy can be very very hard we

24:20

do a lot of work with startups and

24:21

academics and we see

24:23

that the small small companies that we

24:25

work with I'll go over the Internet but

24:27

if you're in this room you're probably

24:28

about tech Pharma and that's not secure

24:31

enough

24:31

all right generally if you're going over

24:33

the internet security is by IP address

24:35

maybe some sort of login and we don't

24:37

for us that's not secure enough so we

24:39

see small you know once companies start

24:42

to mature in the cloud journey they'll

24:44

do a VPN connection so it still runs

24:46

over the Internet however there's IPSec

24:48

tunnel between Amazon and yourself that

24:50

encrypts all the data and it hides it

24:52

from the bad people of course there's

24:56

the big kid in the room the Direct

24:58

Connect and we strongly recommend this

25:00

it comes in different flavors if your

25:03

organization is looking at Direct

25:05

Connect they come in two major

25:07

categories how we put it less than one

25:09

gig of it and one gig imminent over and

25:12

it seems like you know one gig is one

25:15

gigabit is just 10 times the 100 megabit

25:17

right it seems like it but that's not

25:20

the case anything less than one gig it

25:22

only has one virtual interface and if

25:24

you're a network nerd that will mean a

25:26

very big deal because right off the bat

25:27

you need at least two so you go if you

25:30

buy 100 megabit connections you'll need

25:31

200 megabit connections now your that

25:34

alone is more than the price of a single

25:36

gigabit connection a gigabit and a 10

25:38

gate make connection comes with 50

25:41

virtual interfaces you're wondering

25:43

what's a virtual interface every V PC

25:45

that you create will need a virtual

25:47

interface as well as you'll need a

25:48

second 8 or more interfaces for public

25:52

services like s3 now if you have transit

25:54

gateway there's other other topologies

25:56

that you won't necessarily need more

25:58

than a few but you're not at least 2 so

26:01

that's why we recommend why are you

26:03

getting up and you also get the

26:04

additional benefit of having faster

26:06

speed this is how we connect so like I

26:10

said we are a global company with sites

26:11

all around the world I think combined we

26:14

have about a hundred accounts in about

26:16

200 V pcs give or take and different

26:20

workloads will work with other workloads

26:22

so we separate out our accounts into

26:25

different projects and departments so

26:27

one department will have their costs n

26:28

another account their VP C but they work

26:31

very very closely with another

26:32

department and their visa they need to

26:34

exchange data but there may be in

26:36

different regions so

26:37

we have a you know our network we have a

26:39

VPLS mesh that combines all of our sites

26:42

so any site can get to any VPC and any

26:45

different region so if we were to do

26:48

this from a point-to-point a Direct

26:50

Connect for each site we would need over

26:52

30 direct connects instead we don't want

26:55

that mess so instead of every site

26:57

having a different connection to a one

26:59

of the regions that we use they just

27:01

connect to the mesh and we have a

27:02

connect Scala facility where all of our

27:04

network gear is and we have a connection

27:06

from the Equinix facility to Amazon so

27:09

you see on the slide here all the sites

27:11

at the bottom they connected a little

27:12

mesh there go to our customer cage we

27:15

have maybe a few hundred meters

27:17

fiber-optic cable that goes to the

27:19

Amazon cage and then therefore goes into

27:21

the Amazon network and you see the two

27:23

purple icons at the top one for the

27:26

public endpoints and one for the VP sees

27:28

but we have you know many many VPC so on

27:31

the right hand side to see a lot of

27:32

those and to bring it all together this

27:35

is how our architecture looks like from

27:38

on-premise all the way to Amazon so on

27:41

the left hand side you'll see our

27:42

research scientists they go to their

27:45

bench do their experiment load some

27:49

plates go to the laboratory PC if I'll

27:52

say do some preliminary analysis what

27:54

have you and that's all that's all

27:56

they know that works and from then on

27:59

all the automation kicks in they hit

28:01

files when they hit file save it goes to

28:03

the local file server and for us that's

28:05

a NAS device at NetApp we specifically

28:07

picked in that up we didn't pick a

28:09

window server we didn't pick a linux

28:11

server because net up supports multiple

28:13

protocols on the same share or export so

28:16

that's that's a huge win for us we can

28:18

have Apple we can have Linux we can have

28:20

Windows all work off the same directory

28:23

and then those files are then picked up

28:25

by our data sync application that's

28:28

running on a VM then moves it up with

28:30

the aid of us data sync project or the

28:32

service and it pops it directly into s3

28:35

now you could also use Storage Gateway

28:37

so we use a net up also because it's big

28:41

all right so our nap is over a petabyte

28:44

multiple controllers it's redundant but

28:48

if you're only doing you know plate

28:49

readers each PLC's you don't

28:51

have the NGS workloads you don't need to

28:53

you don't need to have a large you know

28:55

petabytes system you can get by and

28:57

eyesores gateway maximum size to 16

28:59

terabytes so if your workloads can fit

29:01

that within the day you can absolutely

29:03

that it's so far easier and cheaper to

29:05

anyway so once your data gets to s3 it

29:09

can kick off all these other workloads

29:10

that we have so you know we have some

29:12

users using Athena for SPARC queries

29:15

sage maker if you're doing AI all in the

29:18

same bucket and of course our CEO is on

29:21

the top you know we struggled in the

29:23

beginning working with our CRO is they

29:25

they didn't want they only wanted to

29:27

FedEx hard drives into us and and then

29:29

we said you should really go to s3 and

29:32

well 80% of our CIOs now support s3 they

29:35

send it directly to our bucket and we're

29:36

happy we have a couple that are kind of

29:39

resisting but they support SFTP so we

29:42

don't have to maintain an NFS server or

29:44

any FTP server for them we say here are

29:47

your credentials and you can

29:48

automatically upload into s3 you don't

29:50

have to be s3 aware I mean with that

29:54

next step will be Pascal and you'll

29:56

talked about digital pathology with the

29:58

data that we have in yesterday Thank You

30:02

Lance good afternoon my name is parse

30:08

co-starring I'm the IT director for

30:11

translational research at cell Dean now

30:13

BMS the infrastructure that Lance just

30:17

outlined and that we built at Celgene

30:19

enables us to share information and data

30:23

with our colleagues and collaborators

30:25

all over the world and one of the

30:28

applications that takes great advantage

30:31

of that infrastructure is our digital

30:33

pathology pathology is mainly concerned

30:37

with phenotyping or classifying tissue

30:40

and cell types of patients and a

30:48

workflow that we have created is the

30:53

images are stained and labeled so you

30:57

buy a color to highlight the tissue or

31:01

the presence of a molecule in a Cell

31:05

in the image and this essay development

31:07

happens upfront and the images are then

31:10

scanned Tibbett typically using a

31:14

microscope and these stains will show up

31:17

in different colors we combine these

31:19

colors in an image where we can

31:25

investigate what has been labeled

31:27

typically multiple pathologists experts

31:31

in reading these images are involved in

31:34

analyzing the state of the sample and

31:38

therefore help determine the course of

31:41

treatment for the patient and depending

31:44

on their availability they can be all

31:46

over the country so not too long ago

31:48

less than two years ago we would store

31:51

save these images on a disk drive and

31:53

ship into the pathologist's that would

31:56

take about two weeks and then we would

31:58

have to create a session with them where

32:02

they would all look at the in the images

32:04

and then in a collaborative session they

32:06

would analyze the images of course with

32:10

the with the infrastructure that we have

32:12

now we can bypass that shipping of the

32:15

hard drives to the pathologists and

32:17

basically directly from the microscope

32:20

we upload them to an s3 bucket and then

32:23

we bypass that step that takes about two

32:27

weeks which is time that is critical for

32:29

the patient and pathologists and experts

32:32

all over the country can now collaborate

32:34

real time with special tools that that

32:37

are available to manipulate the images

32:40

not just to look at them but to pan to

32:43

crop to zoom in to annotate and to

32:47

analyze these images oops so what I'm

32:56

going to explain here is what can we do

33:00

next so now we have these images with

33:02

the pathologists can we do more for

33:05

example can we help them in predicting

33:07

the phenotype of these tissues and cells

33:10

now I must highlight that what I'm about

33:13

to explain here is a very simplified

33:15

process

33:17

of a larger effort that we do at its

33:21

excels in where the scope of the

33:23

analysis and research is much broader

33:26

and deeper then I'm explaining here but

33:28

the approach is this is kind of the same

33:32

so what we want to do is on the top left

33:35

you see a row image we want to find

33:38

these cells in this image we want to

33:41

do a segmentation and then the regions

33:45

of these cells we want to extract some

33:47

features we have about 130 plus features

33:50

that we have defined for each of these

33:52

cells and we want to know if we can use

33:55

that to teach or to to create a model

34:00

that can help us in phenotype prediction

34:06

so one approach is to use commercial

34:09

software so every vendor will ship you

34:13

commercial software that will help you

34:15

with segmentation of cells in an image

34:17

but they take the middle-of-the-road so

34:20

they're not they're not specific to the

34:23

cell types you want to see in general

34:26

they overestimate the number of cells a

34:31

cell boundaries are too wide so what we

34:36

want to do is can we create our own

34:39

process based on machine learning and

34:41

adapter to our cell types so how could a

34:47

I aid in the process so what we're

34:49

looking at is natural object

34:51

segmentation where there is a pixel wise

34:53

assignment of a class to these pixels to

34:57

identify the objects so on the left side

35:00

you see a famous image of three object

35:04

image three object classes one is the

35:07

person one is the bike and one is

35:10

the background on the right side you see

35:13

a different type of cell image that we

35:15

have fed through our algorithm which is

35:19

a picture wise classification algorithm

35:22

which is explained in much more detail

35:24

in other sessions to arrive at an

35:27

output that outlines these cells

35:33

so why do you want to use deep learning

35:35

deep learning image analysis I thought

35:38

this was a kind of a fun slide to look

35:40

at if you look at the error rate of over

35:42

the years in image analysis before 2012

35:49

before machine learning became to the

35:51

stage it was all based on classical

35:55

computer vision and these methods had an

35:58

error rate of about 20 to 30 percent

36:01

once we introduced machine learning and

36:04

deep learning and learned how to use it

36:06

and how to improve on it

36:07

you see these error rates drop

36:10

drastically and in about 2015 it started

36:14

to become a better than human

36:16

classification so that's why we want to

36:18

use that if you look at the images that

36:22

we want to prop process there's are future

36:25

challenges that we have to take into

36:27

account one of them is there's so many

36:29

different types of cell cells that we

36:32

want to image so many different types of

36:36

experimental conditions that is hard to

36:40

find the training data data be more

36:43

specifically to find a ground truth of

36:46

of these image types

36:49

currently there are now lots of teams

36:50

that are looking at this in the world

36:52

and they are publishing the efforts

36:54

around it so more and more ground truth

36:58

becomes available for these types of

36:59

images the pics of value variation the

37:04

gray value is also a concern these are

37:07

16-bit in images and basically you see

37:10

cells in the deep dark to the bright

37:12

gray another typical issue with a tumour

37:18

cells is that they tend to cluster so

37:20

there is not enough background to

37:22

segment them away from from the

37:26

background and I can identify the cells

37:28

and the last one that we need to take

37:31

into account is the sheer number of

37:32

cells as you can see here there's

37:34

thousands of them so the first thing

37:37

that we tried is what about all the

37:40

models that are out there that have been

37:41

trained

37:43

so we tried a mask our CNN which is

37:47

region-based convolutional neural

37:50

network with a resin 850 backbone and as

37:54

you can see the results are kind of

37:56

disappointing it doesn't find all the

37:59

cells it cannot separate the cells that

38:01

are touching but it's actually not that

38:04

surprising because it wasn't trained to

38:05

do this so we had to do this ourselves

38:08

and the tools that we use we use Satan

38:14

maker and then the tools within Sage

38:16

maker is psychic image which is a

38:18

python-based image processing package

38:23

OpenCV which is an open-source computer

38:27

vision library of classical image

38:30

analysis algorithms and pi torch which

38:33

is the python a machine learning package

38:39

so if we look at the segmentation we

38:42

kind of divide it up in two steps this

38:45

is like again this is a simplified

38:50

approach the first thing that we apply

38:53

is a semantic segmentation using unit

38:56

which by itself is a CNN based

38:59

segmentation for medical images and this

39:04

is to separate the cells from the

39:06

background we do not or we do not

39:10

separate the salesmen here i touching

39:12

and then and then the next step is a

39:15

cell detection using a faster our CNN

39:20

algorithm which is since it's a region

39:24

based it draws the bounding boxes around

39:26

the objects these two together we

39:29

combine to create the mask of the cells

39:33

so this is the result of the

39:37

segmentation so on the left side you see

39:39

the image that we use in our example and

39:43

on the right side you see the result

39:44

yellow is cell purple is the background

39:54

the fastest our CNN the detection draws

39:58

boundaries around all these cells now

40:01

this is hard to see so if you zoom in a

40:04

bit you see all these bounding boxes

40:06

around the cells if you take a closer

40:10

look you actually see that it finds the

40:12

cells that are touching it was

40:14

separators out based on the

40:15

characteristics of the model now these

40:21

two together the the object mask and the

40:25

padding boxes we combined using

40:28

microbial watershed to find the actual

40:32

outlines of the cells and on the right

40:35

side you see the results there now if we

40:37

compare that to the vendor software

40:39

result on the left side and I don't know

40:43

which one is better so I asked ours our

40:45

scientist and if you zoom in a little

40:47

bit you see that our cells are more

40:52

uniformly shaped defend ourselves or are

40:56

bigger there's more of them so there's

40:58

more false positives

41:00

now sometimes cells aren't around so

41:03

it's not a good thing to find around all

41:05

the time but if you look at these

41:08

highlights in the red boxes you see the

41:12

difference between the two methods in

41:15

this case they find the same cells I

41:18

think that the cells that we find have a

41:20

better boundary so now that we have the

41:23

masks we're going to extract the

41:27

features based on these masks for these

41:30

cells and then feed that into our

41:32

machine learning to do a phenotype

41:35

prediction so the tools that we use

41:37

there are simple scanning sage maker we

41:41

use pure Python and scikit-learn

41:44

package and I'll go back to this image

41:50

this is actually our test image where we

41:54

have a population of cells of which some

41:56

of them are cd3 positive and we put in a

42:00

label so what we're asking the machine

42:02

learning to do is based on the features

42:06

that it created

42:07

point out which cells are silly three

42:09

positive so all these features are being

42:16

extracted for every region of the cell

42:18

that we find so this is a very sparse

42:21

image you have to remember what we just

42:24

showed in the previous slide where we

42:26

have many many many more

42:27

so 130-plus features for each of these

42:31

regions and we fit and we fed those into

42:35

a number of machine learning algorithms

42:36

and we rank them based on the on the

42:39

accuracy compared to what a human would

42:43

classify as cd3 now this by no means is

42:47

optimized but you can see that the first

42:50

five methods four methods have in

42:53

accuracy of 95 percent or more which is

42:56

amazing compared to what we had before

43:01

with with computer vision so the lesser

43:05

regression algorithm seems to perform

43:09

the best the final step that we did was

43:13

okay so if we do this using principal

43:18

component analysis with Disney which of

43:23

these features actually contribute the

43:24

most to the accuracy of of this that 996

43:28

percent so we've ranked ordered them and

43:30

it turns out that the first 10 or so

43:33

have have the biggest contribution in

43:36

this accuracy so you can do a trade-off

43:43

in terms of computation effort to

43:49

calculate all these features for every

43:51

cell with a little bit of accuracy and

43:56

be much more effective so here we have

44:01

it we have all these images that are

44:03

being delivered to the pathologist's

44:07

anywhere in the world who can access

44:09

these images they have tools to look at

44:12

them and on top of that we can highlight

44:16

these images we can annotate these

44:18

images with the regions of in

44:20

of interest that say hey this is a part

44:22

of the cell that is highly positive for

44:26

in this case CD 3 that might you know

44:31

you might want to take a look at

44:34

that I can see in the future where the

44:36

row of these L of these algorithms are

44:39

become become much more prominent in

44:42

that it will only involve the

44:45

pathologist if there's an issue that has

44:47

been detected and for all the other for

44:51

all the other analyses it it will take

44:55

the Machine results so that's where we

45:01

are at this point I'm gonna hand it over

45:03

to Sam thank you so thank you very much

45:15

and I would like to remind you of the

45:17

other life science sessions that we have

45:19

as well as there's a healthcare life

45:22

science lounge at the end of hallway and

45:24

then you take a left and we will be

45:26

there afterwards too to answer questions

45:29

as well but we also have the opportunity

45:32

we have a microphone set up here and

45:34

I'll run around on this side of room and

45:36

we are available to take any questions

45:39

in the audience right now so if anybody

45:42

has any questions we're happy to answer

45:46

raise your hands there's a microphone

45:50

right up here if you can want to come up

45:52

or just that way everybody can hear you

45:56

if you don't mind

45:57

thank you hi quick question for Lane

46:00

their architecture didn't have databases

46:02

at all do you are you using databases

46:05

redshift or stuff like that because I

46:07

play a role as well yeah we're a big

46:11

proponent of database of the service as

46:13

well so RDS we have a number of

46:15

workloads using you know RTS my sequel a

46:18

little bit of or depending on what your

46:21

needs for recovery are we have I think a

46:23

couple use cases on Rio DB which is

46:26

functionally equivalent as my sequel and

46:29

we also have a few workloads using

46:31

dynamo but

46:32

with the discovery that we don't have any

46:34

redshift I think our sales folks are

46:38

using redshift but today's topic was

46:41

primarily on getting the file data out

46:44

of the labs but we are using some

46:46

database services any other questions

46:55

okay like I said I do encourage you

46:59

there is the healthcare Life Sciences

47:01

lab we'll be down there to answer any

47:03

questions or just just generally meet

47:05

everybody I once again I just really

47:08

thank you all for coming and listening

47:10

and have a good evening so

47:12

[Applause]