Thứ Ba, 31 tháng 7, 2018

Auto news on Youtube Aug 1 2018

>> Hello, everyone. Thanks for coming,

it's my pleasure to introduce Mark.

He is a System Professor in Technion.

He's going to tell us about his work on

General Application Offload to

Near-network processors. Mark welcome.

>> Hi, it's great to be here.

Apparently, the room was fully packed, so it's very nice.

So, let me, first of all,

tell you that it's

a challenge to give this talk

in front of the super-technical,

highly skilled audience because usually I give

this talk in front of students,

so I don't expect the level of

understanding that I get here.

So, I will be very quick on some slides.

Stop me or speed me up if I'm too slow.

I want to start with motivation,

and I shouldn't be spending too much time on this slide.

CPUs are slow, networks are fast,

you don't have time to process.

Now, up to line speed,

packets are small, usual stuff.

You can do a fixed-function offload.

A fixed-function offload is not really

good because things are changing fast.

Actually, I'm preaching to the choir

here in the company that

deployed over a million of FPGAs around the datacenter.

I shouldn't be telling this.

So, you're already convinced.

So, surprise surprise, there are things outside of

the Microsoft Research and Catapult

projects that do very similar things.

This is, in particular,

Mellanox say, "FPGA enabled NIC."

So, what you can see is there is a bump-in-the-wire,

right on the wire between the NIC

and the rest of the world.

Actually, NIC is sitting on

the PCI bus and its regular NIC.

It's really important for, in particular,

for Mellanox and other vendors not to change

the silicon on the NIC

itself because this is very expensive.

So, they basically almost have no changes on the NIC,

and the silicon and just really a blip

on the wire that everything goes through,

and go and figure out how to use it.

So, again, this is a very

easy to sell slide here in this audience,

the first bullet go and ask questions about it,

right? I don't expect this.

But, in the second bullet,

it turns out that I think,

it's pretty clear that

the industry is either following Microsoft,

or Microsoft is somewhat leading the industry,

I don't know, where chicken and egg.

Clearly, the technology trend is

toward the adoption of this kind of architecture.

So, the takeaway is this hardware is here,

it's becoming popular, it's

becoming widely available, what can we do with it?

What usually people do with it is

pretty infrastructural stuff that what really

hurts in terms of performance in datacenters,

you want to offload

infrastructural stuff virtually networking,

or encryption compression, and et cetera.

So, this is our usual suspects.

I believe that Catapult is

being actively used for this specifically.

There is also another type of fourth offload

where your use of this basically as an accelerator,

so you have an FPGA accelerated applications,

you just push them into FPGA,

it happily sitting on the NIC,

but it doesn't matter, that it's on the NIC,

you just don't use that,

or you can connect to

this FPGA and just offload computes there.

So, things are pretty

expected and I think no surprises here,

you probably know it even better than me.

However, what we are trying to

think out loud here is whether it makes sense

to really take intensive applications

and push some of them and basically

accelerate them with the help of

this interesting hardware that is available.

Again, that there is

a subtle difference between what

people have been doing with it before,

and what we are proposing here

because usually it was infrastructure, it was general,

and I will actually highlight the main differences

that stemmed from the fact that

we are targeting general purpose applications here.

So, why moving sound with

the application logic on the NIC?

Again, people have already shown

that doing things in

hardware is much faster than doing that in software.

Apparently, it reduces a lot of

I/O bottlenecks on the host state, how power efficiency?

So, all the goodies that FPGAs can do for you already,

but now, in a single package,

that it's really close to the NIC

and NIC is really close to the host,

and it's also going to be available.

So, why don't we try using it?

So, this talk is

going to start with high-level introduction,

or high-level overview of why we

think or what applications would

be suitable for this kind of hardware?

Then, I will go and get a little bit more technical

with the actual solution

that we propose in order to support those applications,

design and implementation a little bit,

a little bit of evaluation,

and then a bigger picture which I

believe is very important to keep in

mind because we are not

dealing with just network cards in isolation.

We really have to think in a broad perspective

of what this future accelerated systems will look like?

How this thing actually fits into this puzzle?

I would also invite to ask questions on the way.

It would make it much easier because

my stack is pretty shallow.

So, I will not remember

what I told you before after the lecture.

Okay. So, let's start with a simple example.

The example is really engineered

to show the benefits of network application offloading,

and what I want to do here in

this example is to explain why I believe that

this particular type of architecture

of bump-in-the-wire NICs is

the best one that suits

the purpose of accelerating this type of workload.

Then, we'll see other applications

and type of applications that would be suitable.

Okay.

So, you have the very simple anomaly detection service.

This anomaly detection service does really simple thing,

it receives some data,

it checks the model invokes the classifier,

whether the sensor reading,

or some event is normal or abnormal,

and in the case of

normal vantages drops or records some of the statistics.

If it's abnormal, then it

alerts and probably updates the model as well.

So, this is a pretty common, I would say,

anomaly detection service, where

the classifier can be of any complexity as you wish.

From very simple, just

averaging or thresholding to a very complex

of doing some fancy classification

like deep neural networks or whatnot.

So, now we want to accelerate this.

The question is which architecture,

or system architecture would be the most

suitable for acceleration of this title workloads?

So, let's start with a normal,

usual accelerators, GPUs, or FPGAs that sit in PCI bus.

Let's try that. So, what we can use this for is to

offload the compute part of this work-.

Of this application. In order

to do that basically what you do is you

put the classifier and probably the model

updates logic on the the accelerator,

but everything else is being handled on CPU.

So, the packets go through PCIE bus

to the CPU first and then they get shuffled

into the accelerator and then

back and so you can clearly see that there is

a lot of stuff going on on the bus that is

totally redundant in the sense that

the packet gets moved back and forth from CPU

to the accelerator and back,

and so, it's a little bit of mass.

In terms of the efficiency it's definitely not

where we want to be in particularly

if the classifier is pretty simple.

So this would not be really good to push the classifier

that just simple threshold into

the accelerator in this case.

Okay, so, what are the other option?

Another option is something that is commonly

used or promoted in net FPGA community I guess,

where the accelerator is a stand-alone FPGA with an ink,

and you can just offload some stuff in there

and use it as a separate machine if you will,

but just with FPGA.

So, in this case,

what you see is offloading everything.

This is one model and

this is clearly very complex because if you want

really to push the whole application

with 10 lines of code that you really want to

accelerate and the another million that you have to

accelerate or have to put on this thing.

So, basically developmend is becoming a real pain,

and you can clearly see things like

Xeon Phi where the claim for

fame was that you just put

the whole application on

the accelerator and everything runs smoothly,

but it run smoothly but very slowly,

and actually not so smoothly.

So, eventually this model is really hard

to work with from the development perspective.

Right, so, another case

of stand-alone accelerator is

basically although it stand-alone,

you will still use the other Ethernet to

connect to it and push computation state.

It's very simple. It's very similar

to the PCI kind of model,

but is just over the network,

and so here you clearly see that

there are problems again with

the extra network bandwidth

that gets consumed just without any reason.

Alright, so here comes the savior.

Here comes the bump-in-the-wire design

where the nice thing

about it is that the NIC is tightly coupled with the CPU,

but it's also tightly coupled with the accelerator.

So, because of this,

the cost of basically pushing data back and forth between

the accelerator and the NIC

is zero because they're really close together.

Now, what it really gives you is the ability

to basically take the data on the flight

and process it on the accelerator and

then maybe even drop some of it

without pushing down to the host.

So, it's clear that you do some kind of filtering.

It's very efficient because it

basically piggybacks on network processing here,

and that is the key to

allow very lightweight type of

offloads like even thresholding.

What would really work well because it's

not that you would accelerate

the threshold computation itself,

you would accelerate everything that includes

the network processing plus you would

piggyback on that for

threshold computation and then you will

filter quite a lot of traffic

because again we're assuming here that

a normal detection would

usually encounter normal cases, normal events.

What is really important though is that

most of the logic that you want to keep on

CPU can because again it's very close to the NIC.

So, if you want to do some rare

in something like modal updates you can do it

even quite frequently and CPU can trigger that without

wasting cycles or without

wasting too much time on

communicating with this accelerator.

So, this is just a template for one type of workload

that it seems to me would make

sense in case of this converged architecture.

Now, very natural question

to ask is that it is the only workload so,

let's do and build infrastructure for a single workload,

that does not sound too appealing, so,

let's go and try to do some analysis of what kind

of reckless would really fit this architecture.

So one thing that makes sense is filtering,

and again what we can do here is to basically transform

the traffic coming in or going out in

a way where you compress,

you drop a lot of

the the packets or a lot of the traffic that comes in,

and you do the fast path processing on the NIC,

but then you do the heavy waiting for

the rare cases where you actually

need some handling of this stuff on CPU,

and you can think about data analytics workloads that

just do some statistical analysis on the data.

Some caching for example is

a classical example for

this kind of filtering and we will

show in the end of the stock I will show how to

build a little Memcache Dcache that does exactly that.

Another obvious example is data transformation.

You want to do serialization-deserialization

You want to batch things

together and give it

out to the application already in a packed form,

but it also can be used for

optimized GPU and storage layout.

So, you can really go from the NIC directly to GPU and

pack the data so that GPU would not need to parse it.

You can think about all other different things

like sampling that is

really bad for all this regular type of

accelerators like GPU that cannot

really do sampling properly because sampling

involves very irregular memory access so you

kind of lose a lot of efficiency

if you can do that on this architecture

and already prepared the data and give it to the GPU,

suddenly GPU becomes more appealing

for this type of workloads.

Steering is another case where

you don't want to

rely on the infrastructural kind of steering

like RSS in order to

be aware of the let's say load in different processors.

So, it's like a load balancer if you

think for internal machine,

but it's also application aware.

So, it can parse some of the data and figure out

where to put the data in the machine whether it's GPU.

It can be another NIC can be whatever you want.

So, this is steering,

and last but not least generation.

So, you can think about

different use cases where you would want to offload

some basically packaging duration or

work or I would say network massive generation cases,

for example in consensus you would

want to offload that and actually

people did the stand-alone box for

consensus acceleration.

This might be a really nice case

because most of the consensus actually

would run close to either partially on CPU.

So, some weird cases would be handled by CPU,

but the most common would be handled by the NIC.

It could be also a MapReduce Shuffle for example.

So, if you think about it.

If you are doing word count all that you need

is to get some data from storage

and kind of push it up

to the NIC where it would do the mapping,

very easy streaming like mapping and then according to

the key it would send it to

the right reducer without involving CPU altogether.

So, again what you see here is

four types of like I'm not sure

how completely orthogonal they are,

but they definitely some overlap.

So, the classification is not super rigid,

but I think that these four types of workloads

already like kind of makes sense

in terms of trying and pursuing the idea of

making it all possible and not super

hard to use this hardware for this type of workloads.

So, in hope that I have convinced you that it

makes sense now let me tell you why it is not that easy.

So, just a reminder I'm sure you

know that this is how regular application looks like.

So, you have the NIC.

That is basically two degree

protected from the application by network stack.

Application never really goes all the way into the NIC

it just talks to

the second network stack goes and controls the NIC.

If you really want to do what we want to do

with offloading some parts of the application logic,

you really need to basically

push part of the application into the NIC,

so that now, the application actually

combines two components here,

one running in the NIC,

and another in CPU.

Apparently, data at the control

are going very different paths because

now the control of the application should stretch

all the way down to

the part of the application running in the NIC.

The data may go also directly between the application and

the NIC without passing through the network stack.

So, things definitely change here.

So, let's go one by one and see

what these changes imply

in terms of the system infrastructure.

Before I continue so,

you would imagine that this FPGA

comes with some nice framework

to develop applications in it.

Because vendors are probably very

interested in this kind of applications.

Well, surprise, surprise they're not.

They're not because basically,

they killer up is infrastructure acceleration.

So, what they give there is

very rudimentary set of tools or abstractions,

if you can call this abstraction at all.

These are things that you can

use in order to implement infrastructure acceleration.

So, there are low-level interfaces,

like ports to other side of the network,

or the host, and there's stream of bytes,

you see raw bytes.

This is not like packets or L3, L4 level messages.

You also get some buffer management.

Very, very rudimentary at the level

of either a tokens and stuff like that.

So, really, really low-level stuff.

I'm sure you don't want to deal with that.

>> Clarification, Sir. Often you've

[inaudible] the application comes directly to

the FPGA over the PCI bus and certain memory and stuff.

So how will that change?

>> Right. So, you're right.

So, in this particular scenario,

I'm not saying that it's impossible.

It is actually possible and doable what you're asking.

Right. If I understand the question correctly.

So, you're asking what if the application wants to

update memory directly over PCI bus,

how does that come into this picture? Is that correct?

Yeah. So, when I'm looking at the control,

I'm not exactly specifying

how this control is being done.

Does it go through the NIC specifically,

which is the case for the architecture that we work with?

Or it goes through PCI?

But it doesn't really matter for

the purpose of the definition of interfaces.

I just want to know that there is a control and

it's not on the kind of data path if you will.

So, I will talk a little bit about.

>> [inaudible] can go straight from

the application to the FPGA, right?

>> Yeah.

>> You can basically send data back

and forth the infrastructure and memory.

>> Right.

>> Take us through the network stack a bit.

>> That's correct, and this is what

we are going to do as well.

So, the picture is cutting through the network stack.

It's not necessarily goes through the network stack.

[inaudible] the PCI or not?

>> Not yet, it will be in the second-generation.

But, again, for the purpose of the discussion,

what kind of things we want?

I think that this is less of an issue.

So, let's start with the kind

of going in detail over the challenges here.

So, first of all,

we need to add some weight for the CPU application to

interact with the offloaded part.

So, you have some state here and this state

definitely needs to be communicated

or updated together between this part and that part.

As I mentioned that there should be

some network stack interaction because

some data can pass through or

go over it so that there is also some issue

with the fact that the FPGA would want to send packets,

so some of the tables like

neighbor table is actually

on the side of the network stack.

So, how do you make sure that

this table makes all the way into

the FPGA for this functionality to be available there?

There is application control.

Apparently, you want to start and stop and invoke

the the applications on this side,

on the FPGA side, which is

definitely something that is not available today.

Another interesting question is,

how do you program this thing?

Assuming that all the CPU side is being taken care of.

What kind of interfaces you need

inside the FPGA in

order to support this kind of functionality?

So, today, there is no way to expose state in proper way.

There are no interfaces for that.

You see only row packets.

So, some kind of parsing is

necessary and support for high level protocols.

As I mentioned, there are only

very rudimentary low level protocols,

low-level interfaces for sending and receiving data

which also definitely makes

things really hard to operate.

But here comes the really interesting part.

So, and the really interesting part is that usually it's

your machine is not a single application accelerator,

or a single application workload.

You have a bunch of applications.

You have a lot of processes.

Suddenly, when you want to allow

general purpose acceleration and

not infrastructure level acceleration,

you really have to provide

CPUish kind of protection and isolation

normally on the CPU side which is done by

the operating system but also inside the FPGA.

This becomes already interesting because there are

a lot of interesting issues that come up here.

What does it mean really isolation and protection,

and what does it mean for this kind of

processing to be protected and against what?

So, let's start from very simple problem.

The white application on the right side

can easily get access to state for

the blue application on

the left side, and this is not good.

That shouldn't happen. That's pretty clear.

Another problem is that because

the FPGA is completely ignorant

to all the applications on the CPU,

it's not clear or basically

it's clear that all the applications,

all the offloaded parts we'll be able to see

all the packets which is absolutely

unacceptable apparently because that messes up

the the basic premises of network abstraction on CPU.

But another interesting problem here that

the applications running on

the NIC can actually monopolize it.

Because there's no late rate-limiting that is enforced.

If you can see in this architecture,

the NIC is after the PGA.

So, even the rate-limiting that might be available in

the NIC would not be actually applicable,

would not work for this kind of architecture.

But there are also other things, for example spoofing.

Think about it that now the FPGA may actually be

able to inject packets belonging to,

basically spoof packets, to another application

effectively breaking the basic semantics

of this application as well.

So,you shouldn't receive package

that do not belong to you.

So, this is easily accessible.

Last but not least,

because this is a general purpose offloads,

so you really want other applications

that do not use FPGA,

and there are plenty of those I believe on those servers.

You don't want them to be affected

by something that is running on FPGA.

So, you really have to do this,

to have this bypass.

Over this whole FPGA logic and you want to

isolate that from the rest

of the logic that is running there.

So essentially what you have here is

a bunch of requirements that are derived

from a sheer fact that all that you want to do is to

accelerate not the infrastructure

but the application itself.

And to handle that,

we need to introduce

several U abstractions and

I would say another runtime environment,

that helps us achieve these goals and basically

overcome those challenges both

on the CPU side and inside the NIC.

Now in the whole talk,

I kind of mix and use the fact that it's FPGA on

the NIC but practically

it can be any other architecture because I don't think

that whatever we are doing here

is specialized for the FPGA.

Whatever processor you would have on

the NIC you would still need to achieve

or basically break those

or overcome those challenges that I outlined earlier.

>> The programming model will be very different.

>> The programming model inside the NIC would be

different but the services that are

expected on the NIC would be the same.

Okay so again I'm

not really saying that we are

proposing a programming model for the NIC.

This is kind of complimentary if you think about it,

I have not said a word

about whether we program Vivado HLS,

very logo or OpenCL I don't know,

your favorite FPGA programming stuff.

I mean whatever I said here was just about

the high level interfaces that I would need

and how you provide them that's your choice.

So the key abstraction that we introduce here

is that of what we call ikernel.

So when we originally started thinking about it,

our first idea was okay let's use something like GPU,

you invoke, any accelerate you just invoke and you

tear it down whenever you need to stop it.

But it turns out that because this is tightly

coupled with the networking kind of thing infrastructure,

you really need to do something that is much more

networks specific and in the spirit

of network processing if you will.

So from that perspective what you want to

provide is we want to

encapsulate the application code

and state on the NICs through

this obstruction ikernel but what

really makes it special is the way you use it.

The way you use it is that you take

your open socket that already communicates with

the rest of the system and you

attach this ikernel to this thing.

Okay. So this attachment process

basically invokes the logical reroutes,

everything that was supposed to

arrive at the socket or to leave

it basically reroutes it through this ikernel.

Now it is very similar

to the concept of extended Berkeley Packet Filters

to a degree and

actually there was a recent proposal to provide

eBPF up that actually attach to socket and

not to the infrastructural level.

So I am not claiming that this abstraction is

super super novel from

the perspective of the network community.

But definitely it is something that helps establish

a much tighter link between acceleration of

networking and accelerators in general.

>> Just a question. Does this mean

that every application running on

the iNIC must have like a countered part in the OS.

>> Precisely.

>> So effected means you cannot

completely bypass the CPU.

>> No. Okay so completely bypass is

an interesting question that I

will also address in the end of this talk.

Completely meaning you don't want

CPU even to initialize it.

Yes I think we need to have

some grounding on the CPU here in this abstraction.

Data pass that's a different story.

>> Sure.

>> All right so let me give you

kind of a glimpse into how it actually looks like.

I will introduce the interfaces that we use by example.

So the example is really simple.

This is anomaly detection service.

Really stupid one, right,

very very simple you have

a whole bunch of thermal sensors that send the data,

you want to know whether the temperature is

rising somewhere beyond a certain threshold,

above a certain threshold and if it does,

you want to fire some alarm or something.

So very, very simple.

And all the data that comes in,

you want to update some kind of counter that gives you

the average and the number of alerts and stuff like that.

You want to accelerate it on this iNIC

and with the use of ikernel and the idea is basically

that you really offload the threshold comparison to

the iNIC and all the alerts

are going to be handled by the CPU.

Makes sense, right? So what

I really want to show to you is the code.

It is not the exact code that we

have but it's very close.

So first of all you'll create the ikernel.

And this creation is a process that

is kind of not defined at the specification.

So it's a do it how you wish.

If it is FPGA it will just

link some kind of UID from there to here.

But it will also

initialize a whole bunch of other things.

Then we have a set parameter that we want to

initialize the state in the ikernel.

Then we just create the socket, regular stuff.

Nothing unusual, and then you attach.

And so from the point you attach,

this is where the traffic starts flowing.

Okay. So you can ask

all different questions like how do you make

sure that some of the packets are not already in the CPU?

There all this kind of bootstrapping issues.

Let's not talk about it for now.

Then from that point on,

what happens is that the host is going to

issue just regular POSIX calls. Nothing special.

But with a catch that

the only kind of

packets or the only messages that is going to

receive are those that are parsed by

the ikernel and many of

others are going to be handled by the ikernel itself.

And so therefore what we know for

sure that the message that is going to be

arriving over here is the one that needs to trigger

that alert because all the normal messages

are handled by the ikernel.

So from that point on after the alert

is triggered or even actually

before that you want to get

the average temperature that's

where you retrieve some kind

of state from the iNIC

before you actually going and triggering

the alert and finally you do

detaching and destroy after the application has done.

So a couple of

important things to understand about this model.

First of all, it works dynamically.

So you can attach or

detach the ikernel whenever you want.

Then, you use just regular projects.

It's not the only way to communicate.

I will talk about data paths later.

But this is really the key advantage of

this architecture because it's really backward

compatible and it supports legacy as you wish.

So, if you just comment out

all the IK prefixed calls, it will just work.

There's nothing that requires special attention here.

Another interesting part here is

this IK command which is basically an RPC call

into the FPGA that allows you to

both invoke some logic that would basically,

eventually provide or retrieve

the state or will do some computations,

but it is not a shared memory interface.

It's done deliberately because, otherwise,

it's very hard to provide

high performance from the iKernel itself,

because it would have to go to the memory that

is shared between the CPU,

and the iNIC, and that is

a serious issue in terms of performance.

This memories is much slower.

>> So, where is the app itself? The kernel app itself.

>> Just hold, hold your breath.

>> In this model you always assume that

one package will always

go to the CPU and the network stack.

>> No, no, no.

>> What do you with visibly state and stuff like that?

>> So, I expected this question to come earlier

but we are going to support only UDP from this point.

Practicality of this approach is questionable.

It's limited, it's not questionable, it's limited.

But TCP is really,

really, really, really hard to do.

People have done TCP stack on FPGAs.

We haven't gotten to that point.

As far as I heard, Catapult also decided

to forego TCP and

build something on top of UDP

that is reliable and proprietary.

So, the complexity of the thing is I probably

deserves a couple of other talks, but you're right.

For the type of protocols that we want to support,

we definitely want to push all the protocol

handling onto FPGA, for now.

We have some ideas how to combine this with

CPU but it's very hazy so far.

>> Just to clarify, so you're not supporting any TCP?

>> No. no yet.

TCP is a significant undertaking I would say.

So, let's look at the at the other side of the divide.

What happens on the iNIC?

What happens on the iNIC is three types of interfaces.

Some of them refer to the I/O channels,

they're basically the way to

communicate with the host and with the network.

Another is data access, and protocol processing,

passing decisions whether the packets

are going to be or messages are

going to be passed to the host or not.

Finally, the iKernel control and

state access which is going to

be implemented through RPC.

So, let me show very quickly

a little bit more details on this interfaces.

The data input is coming from two virtual ports,

and why are they virtual?

Host and network, two sides.

But why are those virtual interfaces?

Those are virtual because they're actually

shared across all the iKernels,

as well as the bypass.

So, it's the responsibility of gown to multiplex those in

a way that would prevent abuse of these interfaces,

of denial of service, ans stuff like that.

So, basically rate-limiting and multiplexing,

this is a shared resource that we need to handle.

Another is that apparently

the iKernel cannot observe

the data that belongs to other processes.

So, anything that is going to the socket and from,

this are the only packets

that the iKernel is going to observe.

Now, there is this control interface

that allows us to draw packets,

to create new packets,

and to pass them to the other side effectively.

>> Bigger line modification packets as well,

because you need that fits exactly?

>> Yeah, sure. So, I'm going to be talking

about that next part.

So, basically what I'm going to provide here is

the interface that is at this point very FPGA specific,

or I would say not FPGA specific,

it's very specific to

pipelined way of processing networking.

Because pipelining is the number one concern

in order to sustain line rate.

You cannot possibly do everything in a single cycle.

You really have to pipeline,

and by breaking the data coming in into what actually

some of your colleagues called fleet in click and paper.

The fleet is this piece of data that

effectively matches the bus width,

and it can be received all in one cycle,

and so you can actually process

it in as many cycles as you want in order to,

if you can pipeline it,

and then you would sustain the line rate.

But this is something that basically is in

a way dictated by the architecture.

By breaking this, in parallel process,

basically, in pipelined way different fleets.

This is the interface that we provide.

The packets are parsed in metadata and data so you can

actually modify whatever you want, in the way you want,

and as well as drop,

as is already clear from previous board.

Now, in terms of the state as I mentioned,

there is this RPC interface that really

helps to implement some of the functionality.

So, let's go in and show how it really works.

So, what kind of code you write?

The code that I'm writing here is a very,

very cleared up version of Vivado HLS.

Implementation of

this very simple kernel that I mentioned.

It's not too far from reality but it's not the reality.

You can see is that it's event-driven model,

on every event the iKernel gets invoked,

and so there is this code on fleet,

and so you get the fleet itself,

and then you figure out whether it's the first fleet,

so it's the beginning of the packet.

Basically, it gives you the location in a way.

Then, you have to parse in a way that you want,

you have to parse the packet,

and we know that the first bytes of this packet are going

to contain the actual data.

This is going to be parsed here,

then there is this comparison,

and then you decide whether you drop

the packet or you pass it.

Very simple. In parallel so update the counters.

What you can see here is that it is

very clear that you have to report

to the interfaces that you're going to drop this packet,

or you can go and,

I'm sorry, or you can go and pass it.

If you pass it it's a different control path.

In the next iteration of this you are going

to be handling this part

which is basically writing fleet to the host site.

Make sense? Now, this

is the interface that allows

you to retrieve and update state,

and this is a generic interface that

is private to the iKernel.

You're going to be able to retrieve the state

or even perform some computation off the critical path.

There will be logic that will be

generated specifically for that,

and this logic is going to be accessing some of the state

that is shared by the

main data path logic in the iKernel.

But this is definitely not on the critical path,

this is a slow path.

You're not expected to

stream data in and out by using this thing.

In particular for this simple example,

in order to set threshold you go in and set it.

This is something that you can

implement any way you want.

All right, so the complete design looks like this.

There is one side that is over here,

it handles only isolation protection, rate-limiting,

anti-spoofing, and also implements

all the interfaces that is mentioned,

and there is this whole thing

that integrates with the network stack.

I will really briefly talk about

the data path specifically

because not only network stack,

that you don't need network stack in some cases,

you can actually bypass it.

So, let's talk about data path.

This is the most interesting part

getting a little bit more technical even further.

So, the first way to do data path as I mentioned,

is to support legacy through POSIX, that's easy.

But, there is another type of interface that

we provide which we call custom ring.

So, custom ring is something for advanced users.

If so far it was really simple, now it's for advanced.

So, what you're going to do here is

you're going to create a special abstraction

that would be somehow

implemented than the NIC that would allow you to

access specific ring both for TX and RX inside the host.

That means that all the network stack

is completely bypassed at this point.

So, the only thing

that makes it very different from RDMA,

for example, is the fact that it's not

RDMA because it's a ring, right.

So, there is a some kind of actual network ring,

but instead of network packets

that need to be parsed by the stack,

there will be actual application messages

that will just be stacked up in this ring.

This is really convenient to basically offload a lot of

the network processing off the critical path

on the host if it so happens that

the host is also loaded with packets,

but there are many uses for this if you think about it.

They're really plenty of uses and therefore,

I'm really, really looking

forward to this functionality to actually work.

So, far it's been pretty buggy.

But, once it's available, we can do a lot of things.

We can do actual batching because now application

can get the messages in a batch form,

and we can do application-level steering like

everything that I said that I want to do,

we will be able to do by sheer fact that

we would have this custom ring abstraction.

So, another interesting fact here that is to

a degree downside of

this bump into the wire architecture,

is something that turned out to be

a an interesting twist and the fact in the story.

So look at this, you have a FPGA,

the actual structure of the NIC looks like this.

There is a FPGA with two physical interfaces,

the do, the Ethernet and physical layer.

Then, there's a NIC and then there is a CPU.

What happens here on the way is that from here to here,

it's a basically lossless, lossy Ethernet.

There's nothing special about this interface for

the iNIC kind of infrastructure, for the bump into why.

What it really means is that packets can be dropped.

Now, it's totally legitimate to drop

packets in regular Ethernet,

as you can imagine.

But, the problem, yes sir?

>> I'm sorry for interrupting. So, [inaudible]

what that link between the CPU and the NIC,

it's not the PCIE anymore?

>> Right. It is PCIE,

but on the hardware level,

you send packets very reliably,

but it's not RDMA.

So, there is a protocol between

the high level of the stack.

You have Ethernet there.

>> We've got [inaudible] scheme running on the PCIE that will do

the control flow and ensure that the link is lossless.

>> Right. So, PCIE link is lossless.

What is lossy is the fact that

host may not be able to keep up with

the processing the data coming from the NIC.

Again, ignore this part,

ignore this thing, there's nothing, regular NIC.

If you just send packets and the host

cannot keep up, it will start dropping.

That's exactly what's going to happen.

>> The check sum can be wrong.

>> What?

>> The check sum [inaudible].

>> You know checksum

can be wrong is really rare occasion,

but really what happens is that

you're firing 40 gig at the host,

you know, it will not sustain.

So, the buffers will get overloaded

and story is over. It will start dropping.

But, here's the catch,

the packets can be dropped,

but suddenly this whole nice abstraction of extending

application logic into the NIC suddenly breaks.

Because suddenly, there is unreliable links

between one function in

the application and another function in the application.

That really is not good.

So, when we figured that out,

it was too late to fix it properly,

but it turns out that if you think hard,

what you can do is you still can

work around it for most applications

that we were interested in the first stage.

But, this is very clear that we need to

implement the proper flow control

between the iNIC itself and the side of the application.

Because otherwise, some applications may just not work.

So, for example, in the application where you have

this number of alerts,

you count the number of alerts.

If there are too many alerts and host starts dropping,

this counter will be much

higher than the actual number

of layers that the host will be able to handle.

All right. So, this is kind of

a nitty-gritty detail. Yes.

>> I'm interested. Are you saying you believe this is

fundamentally the right way to construct this?

Or are you saying this is essentially a bug in the fact

that the FPGA is currently too far from the CPU, right?

There's two ways of addressing this.

When one is to say the current think bug

within the FPGA essentially on the far side of

the NIC is a temporary hack that we do right

now where we basically have to redigitize.

[inaudible] the level of

processing the FPGA and then we lower

again on the real [inaudible].

This is broken hack and the right

thing to do is to say, well,

the NIC really ought to have if you like

a co-processor interface at the right point in the NIC,

and then the FPGA should be plugged

on that co-processor interface.

Then, we don't have all these stupid no raising and

lowering of the data as it goes through the abstraction,

and therefore we wouldn't have this problem

or maybe you were saying,

I guess that the right point for

that co-processor interface and

the NIC is actually still at the point

in which you might have to deal with more

cleverly with full of control

between that processing into the CPU.

So, where are you on this phase?

>> So, it's a complicated question,

some of it I would be happy to take

offline and we've been thinking about it a lot.

Some answer is that

the newer generations of the NICs and also Catapult too,

as a matter of fact, already allows direct link

from the FPGA to the CPU through PCI.

So, effectively, you're not under

the spell of network interface at all.

So, you can really have real flow control there.

Is it a good idea for data path?

I don't know. Quite frankly,

so far we figured

that the ability for the CPU to drop packets is

actually essential to build

applications with high performance.

So, if you would want to do this

back pressure all the way down to the network,

because some application gets stuck,

it's not really clear that this is

the best solution or the global kind

of the ground scale of things.

>> Why isn't dropping packets the

first thing you push in the FPGA?

>> Okay, that's a good point because

FPGA might not really

know that it needs to draw

packets because the host is already overloaded.

So, this is the matter of like

this is a subtle kind of question of flow control,

how far it should go.

Should it affect the FPGA as well or should it affect

only the fact when the FPGA wants to talk to host?

So, and again this is

a really elaborate discussion that

deserves probably a little bit longer than.

>> Elaborate on that if you operate a link

between the FPGA and the NIC

a tiny rate lower than the

link-between the NIC and the CPU.

The FPGA will always know when to draw packets.

As if you open [inaudible] from

the links on the NIC to the CPU is 40 gigs,

the queue will always have an empty FPGA,

and it should be possible to always know when to draw.

I'm not saying that I cannot know when to drop.

That's not the question.

The question, I think, is whether I should be dropping

before on the FPGA or I should be dropping on the host,

or maybe the fact that FPGA

would sit in the right place on

the NIC would alleviate this problem.

Again, the answer is about

lowering the stack up and down.

It's not clear that the solution of putting

the FPGA after the NIC

is going to alleviate this problem.

For example, NIC does not support TCP mostly.

Okay. So, RDMA, maybe it's a different story.

But again, there are a lot of things that could be done

differently in more efficient way if

the FPGA would sit in the right point,

but I'm not sure that

this particular question would be addressed in this case.

Okay. So again, let me rush through

the evaluation part because I think

that's what maybe is a little bit of interest after all.

So, we run in a bunch of applications.

The first three are micro benchmarks, Echo,

Traffic generator that just does bursts of network sands.

Then we have this look through memcached

and top K heavy hitters implemented on the NIC.

So, I want to talk about

memcached because this is the most elaborate,

I think, example and the most interesting one.

So, I'll skip that and we'll get to memcached.

So, memcached thing is

actually a humongously large cache,

as you can imagine, of 4K keys and values each 10 bytes.

So, this is certainly a really,

really prototyper's thing just to

understand how this thing can work.

Eventually, we're not claiming

that this is the real implementation.

So, the idea is that this is a look-through cache.

It's a very simple implementation as I mentioned,

really. Hash table, basically.

It's populated upon GET,

and it's finally invalidated upon SET

and the reason for this is that if

your set requests get dropped then your cache in

the NIC would be out of sync with those with the host.

Okay. So therefore, we definitely need to

be very consistent with what

host has in order to be able to,

and because of this drop which

basically is the reason why it works this way.

So, correctness is guaranteed

in terms of the consistency of the cache with a host.

Okay. So, just one important point here to understand.

The performance results that we got for zero hit rate

and 100% hit rate are remarkably different.

Alright. So, the host is

a really significant bottleneck here.

So, whatever host needs to handle,

it will be 50 times slower than

the NIC can do at that basically line rate.

This is only 70% of line rate because there is

some bug in the shell that does

not allow us to generate packets quickly.

But fundamentally, the throughput here is bounded

by this formula right where with the CPU throughput,

because what happens when CPU becomes

overloaded it starts dropping and at that point,

it doesn't matter whether the NIC is

fast or not, the packet gets dropped.

So, you actually have to back

pressure at the application level.

So, this is the theoretical curve

here that says this is how this should look like.

But the measure, it is apparently very

similar because that's how

we regulate the load generator.

Okay. So, it should not come as

a surprise and then the question becomes, "Okay,

you have very fast cache

but it's pretty tiny and what can you possibly do?"

So, here, apparently, first of all,

there are FPGAs and NICs with large memory,

even large fast memory.

But also the zipf's is really nice thing.

The skewed distribution of axes is going to allow us

to handle with really small cache.

With a cache of size one percent we would be able

to handle 50% or 60% hit rate,

which is nice and it still

gives us reasonable speed up here.

Okay. So, I'm going to skip to the lessons

learned and really three slides

maybe, for the bigger picture.

Sorry for taking two minutes more. I'm sorry about that.

So, what are the lessons learned?

Expected lessons for software people

doing some hardware stuff.

Hardware development sucks to the degree that it's just

insane and it burns out students,

professors, and it's really, really,

really, really not convenient.

I think it's an opportunity for

those of us who actually got into

this to introduce something new and useful.

That's what we've done with our debug interface.

I didn't have time to talk about it.

But the bigger picture is,

the world is changing and there will be lot of

programmable stuff lying around in your machine,

and the program, well NICs are just one of those,

but there are also GPUs and a bunch of others.

Although they're all nice and programmable, eventually,

you are faced with a really, really,

really tough challenge which is

the programmability wall, right?

Go and program the system with a bunch of accelerators,

each one having some other constraints and

low level programming frameworks

and whatnot without any abstraction.

So, doing this is really, really tough.

So, the goal that we set out

to achieve several years ago was to break this wall.

Okay. The way we are going to break this wall,

we're going to break out

the CPU-Centric Operating System architecture

in a way that would allow us to get

rid of the control and get rid of the CPU being

the only and single point of contact in order to

do operating system stuff and

ION and get operating system services.

With the idea of building

an Accelerator-Centric Operating System

with the operating system services actually spread

out over all this programmable devices.

We have done a little bit of work on that for GPUs,

with the file system and networking

and RDMA support from GPUs.

We are now doing that with NICs and

the next frontier is storage, and then basically,

the idea is to build

a coherent architecture that would allow something like

Unikernel that would have

a different library OS's

running on each one of those and providing

coherent view of the state of

the system without necessarily

having the CPU involved and all that.

So, this is the bigger picture.

This is where we're heading to.

With that, you can

easily derive the future work that we have.

Apparently, some of it is

to improve what you already have,

but there are also other things that in particular,

the last bullet, is something

that is of particular interest here.

It's the use of NIC to

communicate and to invoke tasks on

other devices and vice-versa.

So with that, I would like to thank you.

For more infomation >> GAON: General-purpose Application Offload to Near-Network Processors - Duration: 1:00:03.

-------------------------------------------

VIDEO: Latest Shooting At Homeless Encampment Near Mar Vista Highlights Danger To Neighbors, Concern - Duration: 2:20.

For more infomation >> VIDEO: Latest Shooting At Homeless Encampment Near Mar Vista Highlights Danger To Neighbors, Concern - Duration: 2:20.

-------------------------------------------

Mexico plane crash: AeroMexico jet crashes near Durango - emergency crews at scene | World | News - Duration: 0:53.

For more infomation >> Mexico plane crash: AeroMexico jet crashes near Durango - emergency crews at scene | World | News - Duration: 0:53.

-------------------------------------------

GUPTA CLASSES, SHREYA COMPLEX, NAJIBABAD ROAD, NEAR SBI, SHAKTI CHOWK, BIJNOR - Duration: 1:26.

For more infomation >> GUPTA CLASSES, SHREYA COMPLEX, NAJIBABAD ROAD, NEAR SBI, SHAKTI CHOWK, BIJNOR - Duration: 1:26.

-------------------------------------------

14-year-old shot in neck with stolen gun near Tomball - Duration: 1:43.

For more infomation >> 14-year-old shot in neck with stolen gun near Tomball - Duration: 1:43.

-------------------------------------------

Evacuations Ordered Near Cache Creek Fire - Duration: 0:16.

For more infomation >> Evacuations Ordered Near Cache Creek Fire - Duration: 0:16.

-------------------------------------------

People who live near Inks Lake State Park return home to assess wildfire damage - Duration: 2:14.

For more infomation >> People who live near Inks Lake State Park return home to assess wildfire damage - Duration: 2:14.

-------------------------------------------

Person shot near West Roxbury church parking lot - Duration: 1:05.

For more infomation >> Person shot near West Roxbury church parking lot - Duration: 1:05.

-------------------------------------------

Neighborhood evacuated following SWAT situation near Alvernon and Valencia - Duration: 0:22.

For more infomation >> Neighborhood evacuated following SWAT situation near Alvernon and Valencia - Duration: 0:22.

-------------------------------------------

Paper mill, dump near wells focus of DEQ - Duration: 3:37.

For more infomation >> Paper mill, dump near wells focus of DEQ - Duration: 3:37.

-------------------------------------------

Crews Battle Grass Fire Near Applegate - Duration: 1:48.

For more infomation >> Crews Battle Grass Fire Near Applegate - Duration: 1:48.

-------------------------------------------

Grass fire torches 100+ acres near Maryhill - Duration: 1:10.

For more infomation >> Grass fire torches 100+ acres near Maryhill - Duration: 1:10.

-------------------------------------------

Driver crashes off Hwy 1 in Lompoc near Allan Hancock College - Duration: 0:15.

For more infomation >> Driver crashes off Hwy 1 in Lompoc near Allan Hancock College - Duration: 0:15.

-------------------------------------------

Caught on camera: Thief takes child's bike near 67th, Sprague - Duration: 1:36.

For more infomation >> Caught on camera: Thief takes child's bike near 67th, Sprague - Duration: 1:36.

-------------------------------------------

15-acre wildfire slowed down near White Salmon - Duration: 0:53.

For more infomation >> 15-acre wildfire slowed down near White Salmon - Duration: 0:53.

-------------------------------------------

Pilot injured in plane small plane crash near Utah-Colorado border - Duration: 0:39.

For more infomation >> Pilot injured in plane small plane crash near Utah-Colorado border - Duration: 0:39.

-------------------------------------------

Sunrise Summer: Top 5 hikes near the metro - Duration: 0:53.

For more infomation >> Sunrise Summer: Top 5 hikes near the metro - Duration: 0:53.

-------------------------------------------

Plane Crash Near Brainerd Airport - Duration: 1:17.

YOUR SPORTS.

>>> HI, EVERYONE.

THANKS FOR JOINING US.

AUTHORITIES ARE INVESTIGATING A

PLAN CRASH IN BRAINERD TODAY,

THE THIRD IN THE STATE IN AS

MANY DAYS.

OFFICIALS SAY NO ONE WAS

SERIOUSLY INJURED WHEN THE PLANE

WENT DOWN WHILE ATTEMPTING TO

LAND AT THE BRAINERD LAKES

REGIONAL AIRPORT.

THE FEDERAL AVIATION

ADMINISTRATION SAYS TWO PEOPLE

WERE ON BOARD THE FIXED-WING,

SINGLE ENGINE AIRPLANE WHEN IT

CRASHED AT ABOUT 11:00 A.M.

THE SEARCH FOR THE PLANE AND THE

COUPLE ON BOARD WAS DIFFICULT

BECAUSE THE PLANE CRASHED IN A

HEAVILY WOODED AREA NEAR

SWAMPLAND.

THE BRAINERD DISPATCH REPORTS A

STATE PATROL HELICOPTER LOCATED

VINCENT AND JODY FACCHIANO OF

ILLINOIS NEAR THE RAILROAD

TRACKS IN THE AREA.

THE TWO WERE AIRLIFTED FROM THE

SCENE WITH ONLY MINOR INJURIES,

INCLUDING BUMPS AND BRUISES.

CROW WING COUNTY SHERIFF'S

OFFICIALS TOLD THE DISPATCH THAT

THE PILOT, 60-YEAR-OLD VINCENT

FACCHIANO, SAID THE ENGINE

FAILED UPON APPROACH AND THE

PLANE WAS UNABLE TO REACH THE

RUNWAY.

OVER THE WEEKEND, TWO OTHER

SMALL PLANES CRASHED IN WESTERN

MINNESOTA'S DOUGLAS COUNTY