>> Hello, everyone. Thanks for coming,
it's my pleasure to introduce Mark.
He is a System Professor in Technion.
He's going to tell us about his work on
General Application Offload to
Near-network processors. Mark welcome.
>> Hi, it's great to be here.
Apparently, the room was fully packed, so it's very nice.
So, let me, first of all,
tell you that it's
a challenge to give this talk
in front of the super-technical,
highly skilled audience because usually I give
this talk in front of students,
so I don't expect the level of
understanding that I get here.
So, I will be very quick on some slides.
Stop me or speed me up if I'm too slow.
I want to start with motivation,
and I shouldn't be spending too much time on this slide.
CPUs are slow, networks are fast,
you don't have time to process.
Now, up to line speed,
packets are small, usual stuff.
You can do a fixed-function offload.
A fixed-function offload is not really
good because things are changing fast.
Actually, I'm preaching to the choir
here in the company that
deployed over a million of FPGAs around the datacenter.
I shouldn't be telling this.
So, you're already convinced.
So, surprise surprise, there are things outside of
the Microsoft Research and Catapult
projects that do very similar things.
This is, in particular,
Mellanox say, "FPGA enabled NIC."
So, what you can see is there is a bump-in-the-wire,
right on the wire between the NIC
and the rest of the world.
Actually, NIC is sitting on
the PCI bus and its regular NIC.
It's really important for, in particular,
for Mellanox and other vendors not to change
the silicon on the NIC
itself because this is very expensive.
So, they basically almost have no changes on the NIC,
and the silicon and just really a blip
on the wire that everything goes through,
and go and figure out how to use it.
So, again, this is a very
easy to sell slide here in this audience,
the first bullet go and ask questions about it,
right? I don't expect this.
But, in the second bullet,
it turns out that I think,
it's pretty clear that
the industry is either following Microsoft,
or Microsoft is somewhat leading the industry,
I don't know, where chicken and egg.
Clearly, the technology trend is
toward the adoption of this kind of architecture.
So, the takeaway is this hardware is here,
it's becoming popular, it's
becoming widely available, what can we do with it?
What usually people do with it is
pretty infrastructural stuff that what really
hurts in terms of performance in datacenters,
you want to offload
infrastructural stuff virtually networking,
or encryption compression, and et cetera.
So, this is our usual suspects.
I believe that Catapult is
being actively used for this specifically.
There is also another type of fourth offload
where your use of this basically as an accelerator,
so you have an FPGA accelerated applications,
you just push them into FPGA,
it happily sitting on the NIC,
but it doesn't matter, that it's on the NIC,
you just don't use that,
or you can connect to
this FPGA and just offload computes there.
So, things are pretty
expected and I think no surprises here,
you probably know it even better than me.
However, what we are trying to
think out loud here is whether it makes sense
to really take intensive applications
and push some of them and basically
accelerate them with the help of
this interesting hardware that is available.
Again, that there is
a subtle difference between what
people have been doing with it before,
and what we are proposing here
because usually it was infrastructure, it was general,
and I will actually highlight the main differences
that stemmed from the fact that
we are targeting general purpose applications here.
So, why moving sound with
the application logic on the NIC?
Again, people have already shown
that doing things in
hardware is much faster than doing that in software.
Apparently, it reduces a lot of
I/O bottlenecks on the host state, how power efficiency?
So, all the goodies that FPGAs can do for you already,
but now, in a single package,
that it's really close to the NIC
and NIC is really close to the host,
and it's also going to be available.
So, why don't we try using it?
So, this talk is
going to start with high-level introduction,
or high-level overview of why we
think or what applications would
be suitable for this kind of hardware?
Then, I will go and get a little bit more technical
with the actual solution
that we propose in order to support those applications,
design and implementation a little bit,
a little bit of evaluation,
and then a bigger picture which I
believe is very important to keep in
mind because we are not
dealing with just network cards in isolation.
We really have to think in a broad perspective
of what this future accelerated systems will look like?
How this thing actually fits into this puzzle?
I would also invite to ask questions on the way.
It would make it much easier because
my stack is pretty shallow.
So, I will not remember
what I told you before after the lecture.
Okay. So, let's start with a simple example.
The example is really engineered
to show the benefits of network application offloading,
and what I want to do here in
this example is to explain why I believe that
this particular type of architecture
of bump-in-the-wire NICs is
the best one that suits
the purpose of accelerating this type of workload.
Then, we'll see other applications
and type of applications that would be suitable.
Okay.
So, you have the very simple anomaly detection service.
This anomaly detection service does really simple thing,
it receives some data,
it checks the model invokes the classifier,
whether the sensor reading,
or some event is normal or abnormal,
and in the case of
normal vantages drops or records some of the statistics.
If it's abnormal, then it
alerts and probably updates the model as well.
So, this is a pretty common, I would say,
anomaly detection service, where
the classifier can be of any complexity as you wish.
From very simple, just
averaging or thresholding to a very complex
of doing some fancy classification
like deep neural networks or whatnot.
So, now we want to accelerate this.
The question is which architecture,
or system architecture would be the most
suitable for acceleration of this title workloads?
So, let's start with a normal,
usual accelerators, GPUs, or FPGAs that sit in PCI bus.
Let's try that. So, what we can use this for is to
offload the compute part of this work-.
Of this application. In order
to do that basically what you do is you
put the classifier and probably the model
updates logic on the the accelerator,
but everything else is being handled on CPU.
So, the packets go through PCIE bus
to the CPU first and then they get shuffled
into the accelerator and then
back and so you can clearly see that there is
a lot of stuff going on on the bus that is
totally redundant in the sense that
the packet gets moved back and forth from CPU
to the accelerator and back,
and so, it's a little bit of mass.
In terms of the efficiency it's definitely not
where we want to be in particularly
if the classifier is pretty simple.
So this would not be really good to push the classifier
that just simple threshold into
the accelerator in this case.
Okay, so, what are the other option?
Another option is something that is commonly
used or promoted in net FPGA community I guess,
where the accelerator is a stand-alone FPGA with an ink,
and you can just offload some stuff in there
and use it as a separate machine if you will,
but just with FPGA.
So, in this case,
what you see is offloading everything.
This is one model and
this is clearly very complex because if you want
really to push the whole application
with 10 lines of code that you really want to
accelerate and the another million that you have to
accelerate or have to put on this thing.
So, basically developmend is becoming a real pain,
and you can clearly see things like
Xeon Phi where the claim for
fame was that you just put
the whole application on
the accelerator and everything runs smoothly,
but it run smoothly but very slowly,
and actually not so smoothly.
So, eventually this model is really hard
to work with from the development perspective.
Right, so, another case
of stand-alone accelerator is
basically although it stand-alone,
you will still use the other Ethernet to
connect to it and push computation state.
It's very simple. It's very similar
to the PCI kind of model,
but is just over the network,
and so here you clearly see that
there are problems again with
the extra network bandwidth
that gets consumed just without any reason.
Alright, so here comes the savior.
Here comes the bump-in-the-wire design
where the nice thing
about it is that the NIC is tightly coupled with the CPU,
but it's also tightly coupled with the accelerator.
So, because of this,
the cost of basically pushing data back and forth between
the accelerator and the NIC
is zero because they're really close together.
Now, what it really gives you is the ability
to basically take the data on the flight
and process it on the accelerator and
then maybe even drop some of it
without pushing down to the host.
So, it's clear that you do some kind of filtering.
It's very efficient because it
basically piggybacks on network processing here,
and that is the key to
allow very lightweight type of
offloads like even thresholding.
What would really work well because it's
not that you would accelerate
the threshold computation itself,
you would accelerate everything that includes
the network processing plus you would
piggyback on that for
threshold computation and then you will
filter quite a lot of traffic
because again we're assuming here that
a normal detection would
usually encounter normal cases, normal events.
What is really important though is that
most of the logic that you want to keep on
CPU can because again it's very close to the NIC.
So, if you want to do some rare
in something like modal updates you can do it
even quite frequently and CPU can trigger that without
wasting cycles or without
wasting too much time on
communicating with this accelerator.
So, this is just a template for one type of workload
that it seems to me would make
sense in case of this converged architecture.
Now, very natural question
to ask is that it is the only workload so,
let's do and build infrastructure for a single workload,
that does not sound too appealing, so,
let's go and try to do some analysis of what kind
of reckless would really fit this architecture.
So one thing that makes sense is filtering,
and again what we can do here is to basically transform
the traffic coming in or going out in
a way where you compress,
you drop a lot of
the the packets or a lot of the traffic that comes in,
and you do the fast path processing on the NIC,
but then you do the heavy waiting for
the rare cases where you actually
need some handling of this stuff on CPU,
and you can think about data analytics workloads that
just do some statistical analysis on the data.
Some caching for example is
a classical example for
this kind of filtering and we will
show in the end of the stock I will show how to
build a little Memcache Dcache that does exactly that.
Another obvious example is data transformation.
You want to do serialization-deserialization
You want to batch things
together and give it
out to the application already in a packed form,
but it also can be used for
optimized GPU and storage layout.
So, you can really go from the NIC directly to GPU and
pack the data so that GPU would not need to parse it.
You can think about all other different things
like sampling that is
really bad for all this regular type of
accelerators like GPU that cannot
really do sampling properly because sampling
involves very irregular memory access so you
kind of lose a lot of efficiency
if you can do that on this architecture
and already prepared the data and give it to the GPU,
suddenly GPU becomes more appealing
for this type of workloads.
Steering is another case where
you don't want to
rely on the infrastructural kind of steering
like RSS in order to
be aware of the let's say load in different processors.
So, it's like a load balancer if you
think for internal machine,
but it's also application aware.
So, it can parse some of the data and figure out
where to put the data in the machine whether it's GPU.
It can be another NIC can be whatever you want.
So, this is steering,
and last but not least generation.
So, you can think about
different use cases where you would want to offload
some basically packaging duration or
work or I would say network massive generation cases,
for example in consensus you would
want to offload that and actually
people did the stand-alone box for
consensus acceleration.
This might be a really nice case
because most of the consensus actually
would run close to either partially on CPU.
So, some weird cases would be handled by CPU,
but the most common would be handled by the NIC.
It could be also a MapReduce Shuffle for example.
So, if you think about it.
If you are doing word count all that you need
is to get some data from storage
and kind of push it up
to the NIC where it would do the mapping,
very easy streaming like mapping and then according to
the key it would send it to
the right reducer without involving CPU altogether.
So, again what you see here is
four types of like I'm not sure
how completely orthogonal they are,
but they definitely some overlap.
So, the classification is not super rigid,
but I think that these four types of workloads
already like kind of makes sense
in terms of trying and pursuing the idea of
making it all possible and not super
hard to use this hardware for this type of workloads.
So, in hope that I have convinced you that it
makes sense now let me tell you why it is not that easy.
So, just a reminder I'm sure you
know that this is how regular application looks like.
So, you have the NIC.
That is basically two degree
protected from the application by network stack.
Application never really goes all the way into the NIC
it just talks to
the second network stack goes and controls the NIC.
If you really want to do what we want to do
with offloading some parts of the application logic,
you really need to basically
push part of the application into the NIC,
so that now, the application actually
combines two components here,
one running in the NIC,
and another in CPU.
Apparently, data at the control
are going very different paths because
now the control of the application should stretch
all the way down to
the part of the application running in the NIC.
The data may go also directly between the application and
the NIC without passing through the network stack.
So, things definitely change here.
So, let's go one by one and see
what these changes imply
in terms of the system infrastructure.
Before I continue so,
you would imagine that this FPGA
comes with some nice framework
to develop applications in it.
Because vendors are probably very
interested in this kind of applications.
Well, surprise, surprise they're not.
They're not because basically,
they killer up is infrastructure acceleration.
So, what they give there is
very rudimentary set of tools or abstractions,
if you can call this abstraction at all.
These are things that you can
use in order to implement infrastructure acceleration.
So, there are low-level interfaces,
like ports to other side of the network,
or the host, and there's stream of bytes,
you see raw bytes.
This is not like packets or L3, L4 level messages.
You also get some buffer management.
Very, very rudimentary at the level
of either a tokens and stuff like that.
So, really, really low-level stuff.
I'm sure you don't want to deal with that.
>> Clarification, Sir. Often you've
[inaudible] the application comes directly to
the FPGA over the PCI bus and certain memory and stuff.
So how will that change?
>> Right. So, you're right.
So, in this particular scenario,
I'm not saying that it's impossible.
It is actually possible and doable what you're asking.
Right. If I understand the question correctly.
So, you're asking what if the application wants to
update memory directly over PCI bus,
how does that come into this picture? Is that correct?
Yeah. So, when I'm looking at the control,
I'm not exactly specifying
how this control is being done.
Does it go through the NIC specifically,
which is the case for the architecture that we work with?
Or it goes through PCI?
But it doesn't really matter for
the purpose of the definition of interfaces.
I just want to know that there is a control and
it's not on the kind of data path if you will.
So, I will talk a little bit about.
>> [inaudible] can go straight from
the application to the FPGA, right?
>> Yeah.
>> You can basically send data back
and forth the infrastructure and memory.
>> Right.
>> Take us through the network stack a bit.
>> That's correct, and this is what
we are going to do as well.
So, the picture is cutting through the network stack.
It's not necessarily goes through the network stack.
[inaudible] the PCI or not?
>> Not yet, it will be in the second-generation.
But, again, for the purpose of the discussion,
what kind of things we want?
I think that this is less of an issue.
So, let's start with the kind
of going in detail over the challenges here.
So, first of all,
we need to add some weight for the CPU application to
interact with the offloaded part.
So, you have some state here and this state
definitely needs to be communicated
or updated together between this part and that part.
As I mentioned that there should be
some network stack interaction because
some data can pass through or
go over it so that there is also some issue
with the fact that the FPGA would want to send packets,
so some of the tables like
neighbor table is actually
on the side of the network stack.
So, how do you make sure that
this table makes all the way into
the FPGA for this functionality to be available there?
There is application control.
Apparently, you want to start and stop and invoke
the the applications on this side,
on the FPGA side, which is
definitely something that is not available today.
Another interesting question is,
how do you program this thing?
Assuming that all the CPU side is being taken care of.
What kind of interfaces you need
inside the FPGA in
order to support this kind of functionality?
So, today, there is no way to expose state in proper way.
There are no interfaces for that.
You see only row packets.
So, some kind of parsing is
necessary and support for high level protocols.
As I mentioned, there are only
very rudimentary low level protocols,
low-level interfaces for sending and receiving data
which also definitely makes
things really hard to operate.
But here comes the really interesting part.
So, and the really interesting part is that usually it's
your machine is not a single application accelerator,
or a single application workload.
You have a bunch of applications.
You have a lot of processes.
Suddenly, when you want to allow
general purpose acceleration and
not infrastructure level acceleration,
you really have to provide
CPUish kind of protection and isolation
normally on the CPU side which is done by
the operating system but also inside the FPGA.
This becomes already interesting because there are
a lot of interesting issues that come up here.
What does it mean really isolation and protection,
and what does it mean for this kind of
processing to be protected and against what?
So, let's start from very simple problem.
The white application on the right side
can easily get access to state for
the blue application on
the left side, and this is not good.
That shouldn't happen. That's pretty clear.
Another problem is that because
the FPGA is completely ignorant
to all the applications on the CPU,
it's not clear or basically
it's clear that all the applications,
all the offloaded parts we'll be able to see
all the packets which is absolutely
unacceptable apparently because that messes up
the the basic premises of network abstraction on CPU.
But another interesting problem here that
the applications running on
the NIC can actually monopolize it.
Because there's no late rate-limiting that is enforced.
If you can see in this architecture,
the NIC is after the PGA.
So, even the rate-limiting that might be available in
the NIC would not be actually applicable,
would not work for this kind of architecture.
But there are also other things, for example spoofing.
Think about it that now the FPGA may actually be
able to inject packets belonging to,
basically spoof packets, to another application
effectively breaking the basic semantics
of this application as well.
So,you shouldn't receive package
that do not belong to you.
So, this is easily accessible.
Last but not least,
because this is a general purpose offloads,
so you really want other applications
that do not use FPGA,
and there are plenty of those I believe on those servers.
You don't want them to be affected
by something that is running on FPGA.
So, you really have to do this,
to have this bypass.
Over this whole FPGA logic and you want to
isolate that from the rest
of the logic that is running there.
So essentially what you have here is
a bunch of requirements that are derived
from a sheer fact that all that you want to do is to
accelerate not the infrastructure
but the application itself.
And to handle that,
we need to introduce
several U abstractions and
I would say another runtime environment,
that helps us achieve these goals and basically
overcome those challenges both
on the CPU side and inside the NIC.
Now in the whole talk,
I kind of mix and use the fact that it's FPGA on
the NIC but practically
it can be any other architecture because I don't think
that whatever we are doing here
is specialized for the FPGA.
Whatever processor you would have on
the NIC you would still need to achieve
or basically break those
or overcome those challenges that I outlined earlier.
>> The programming model will be very different.
>> The programming model inside the NIC would be
different but the services that are
expected on the NIC would be the same.
Okay so again I'm
not really saying that we are
proposing a programming model for the NIC.
This is kind of complimentary if you think about it,
I have not said a word
about whether we program Vivado HLS,
very logo or OpenCL I don't know,
your favorite FPGA programming stuff.
I mean whatever I said here was just about
the high level interfaces that I would need
and how you provide them that's your choice.
So the key abstraction that we introduce here
is that of what we call ikernel.
So when we originally started thinking about it,
our first idea was okay let's use something like GPU,
you invoke, any accelerate you just invoke and you
tear it down whenever you need to stop it.
But it turns out that because this is tightly
coupled with the networking kind of thing infrastructure,
you really need to do something that is much more
networks specific and in the spirit
of network processing if you will.
So from that perspective what you want to
provide is we want to
encapsulate the application code
and state on the NICs through
this obstruction ikernel but what
really makes it special is the way you use it.
The way you use it is that you take
your open socket that already communicates with
the rest of the system and you
attach this ikernel to this thing.
Okay. So this attachment process
basically invokes the logical reroutes,
everything that was supposed to
arrive at the socket or to leave
it basically reroutes it through this ikernel.
Now it is very similar
to the concept of extended Berkeley Packet Filters
to a degree and
actually there was a recent proposal to provide
eBPF up that actually attach to socket and
not to the infrastructural level.
So I am not claiming that this abstraction is
super super novel from
the perspective of the network community.
But definitely it is something that helps establish
a much tighter link between acceleration of
networking and accelerators in general.
>> Just a question. Does this mean
that every application running on
the iNIC must have like a countered part in the OS.
>> Precisely.
>> So effected means you cannot
completely bypass the CPU.
>> No. Okay so completely bypass is
an interesting question that I
will also address in the end of this talk.
Completely meaning you don't want
CPU even to initialize it.
Yes I think we need to have
some grounding on the CPU here in this abstraction.
Data pass that's a different story.
>> Sure.
>> All right so let me give you
kind of a glimpse into how it actually looks like.
I will introduce the interfaces that we use by example.
So the example is really simple.
This is anomaly detection service.
Really stupid one, right,
very very simple you have
a whole bunch of thermal sensors that send the data,
you want to know whether the temperature is
rising somewhere beyond a certain threshold,
above a certain threshold and if it does,
you want to fire some alarm or something.
So very, very simple.
And all the data that comes in,
you want to update some kind of counter that gives you
the average and the number of alerts and stuff like that.
You want to accelerate it on this iNIC
and with the use of ikernel and the idea is basically
that you really offload the threshold comparison to
the iNIC and all the alerts
are going to be handled by the CPU.
Makes sense, right? So what
I really want to show to you is the code.
It is not the exact code that we
have but it's very close.
So first of all you'll create the ikernel.
And this creation is a process that
is kind of not defined at the specification.
So it's a do it how you wish.
If it is FPGA it will just
link some kind of UID from there to here.
But it will also
initialize a whole bunch of other things.
Then we have a set parameter that we want to
initialize the state in the ikernel.
Then we just create the socket, regular stuff.
Nothing unusual, and then you attach.
And so from the point you attach,
this is where the traffic starts flowing.
Okay. So you can ask
all different questions like how do you make
sure that some of the packets are not already in the CPU?
There all this kind of bootstrapping issues.
Let's not talk about it for now.
Then from that point on,
what happens is that the host is going to
issue just regular POSIX calls. Nothing special.
But with a catch that
the only kind of
packets or the only messages that is going to
receive are those that are parsed by
the ikernel and many of
others are going to be handled by the ikernel itself.
And so therefore what we know for
sure that the message that is going to be
arriving over here is the one that needs to trigger
that alert because all the normal messages
are handled by the ikernel.
So from that point on after the alert
is triggered or even actually
before that you want to get
the average temperature that's
where you retrieve some kind
of state from the iNIC
before you actually going and triggering
the alert and finally you do
detaching and destroy after the application has done.
So a couple of
important things to understand about this model.
First of all, it works dynamically.
So you can attach or
detach the ikernel whenever you want.
Then, you use just regular projects.
It's not the only way to communicate.
I will talk about data paths later.
But this is really the key advantage of
this architecture because it's really backward
compatible and it supports legacy as you wish.
So, if you just comment out
all the IK prefixed calls, it will just work.
There's nothing that requires special attention here.
Another interesting part here is
this IK command which is basically an RPC call
into the FPGA that allows you to
both invoke some logic that would basically,
eventually provide or retrieve
the state or will do some computations,
but it is not a shared memory interface.
It's done deliberately because, otherwise,
it's very hard to provide
high performance from the iKernel itself,
because it would have to go to the memory that
is shared between the CPU,
and the iNIC, and that is
a serious issue in terms of performance.
This memories is much slower.
>> So, where is the app itself? The kernel app itself.
>> Just hold, hold your breath.
>> In this model you always assume that
one package will always
go to the CPU and the network stack.
>> No, no, no.
>> What do you with visibly state and stuff like that?
>> So, I expected this question to come earlier
but we are going to support only UDP from this point.
Practicality of this approach is questionable.
It's limited, it's not questionable, it's limited.
But TCP is really,
really, really, really hard to do.
People have done TCP stack on FPGAs.
We haven't gotten to that point.
As far as I heard, Catapult also decided
to forego TCP and
build something on top of UDP
that is reliable and proprietary.
So, the complexity of the thing is I probably
deserves a couple of other talks, but you're right.
For the type of protocols that we want to support,
we definitely want to push all the protocol
handling onto FPGA, for now.
We have some ideas how to combine this with
CPU but it's very hazy so far.
>> Just to clarify, so you're not supporting any TCP?
>> No. no yet.
TCP is a significant undertaking I would say.
So, let's look at the at the other side of the divide.
What happens on the iNIC?
What happens on the iNIC is three types of interfaces.
Some of them refer to the I/O channels,
they're basically the way to
communicate with the host and with the network.
Another is data access, and protocol processing,
passing decisions whether the packets
are going to be or messages are
going to be passed to the host or not.
Finally, the iKernel control and
state access which is going to
be implemented through RPC.
So, let me show very quickly
a little bit more details on this interfaces.
The data input is coming from two virtual ports,
and why are they virtual?
Host and network, two sides.
But why are those virtual interfaces?
Those are virtual because they're actually
shared across all the iKernels,
as well as the bypass.
So, it's the responsibility of gown to multiplex those in
a way that would prevent abuse of these interfaces,
of denial of service, ans stuff like that.
So, basically rate-limiting and multiplexing,
this is a shared resource that we need to handle.
Another is that apparently
the iKernel cannot observe
the data that belongs to other processes.
So, anything that is going to the socket and from,
this are the only packets
that the iKernel is going to observe.
Now, there is this control interface
that allows us to draw packets,
to create new packets,
and to pass them to the other side effectively.
>> Bigger line modification packets as well,
because you need that fits exactly?
>> Yeah, sure. So, I'm going to be talking
about that next part.
So, basically what I'm going to provide here is
the interface that is at this point very FPGA specific,
or I would say not FPGA specific,
it's very specific to
pipelined way of processing networking.
Because pipelining is the number one concern
in order to sustain line rate.
You cannot possibly do everything in a single cycle.
You really have to pipeline,
and by breaking the data coming in into what actually
some of your colleagues called fleet in click and paper.
The fleet is this piece of data that
effectively matches the bus width,
and it can be received all in one cycle,
and so you can actually process
it in as many cycles as you want in order to,
if you can pipeline it,
and then you would sustain the line rate.
But this is something that basically is in
a way dictated by the architecture.
By breaking this, in parallel process,
basically, in pipelined way different fleets.
This is the interface that we provide.
The packets are parsed in metadata and data so you can
actually modify whatever you want, in the way you want,
and as well as drop,
as is already clear from previous board.
Now, in terms of the state as I mentioned,
there is this RPC interface that really
helps to implement some of the functionality.
So, let's go in and show how it really works.
So, what kind of code you write?
The code that I'm writing here is a very,
very cleared up version of Vivado HLS.
Implementation of
this very simple kernel that I mentioned.
It's not too far from reality but it's not the reality.
You can see is that it's event-driven model,
on every event the iKernel gets invoked,
and so there is this code on fleet,
and so you get the fleet itself,
and then you figure out whether it's the first fleet,
so it's the beginning of the packet.
Basically, it gives you the location in a way.
Then, you have to parse in a way that you want,
you have to parse the packet,
and we know that the first bytes of this packet are going
to contain the actual data.
This is going to be parsed here,
then there is this comparison,
and then you decide whether you drop
the packet or you pass it.
Very simple. In parallel so update the counters.
What you can see here is that it is
very clear that you have to report
to the interfaces that you're going to drop this packet,
or you can go and,
I'm sorry, or you can go and pass it.
If you pass it it's a different control path.
In the next iteration of this you are going
to be handling this part
which is basically writing fleet to the host site.
Make sense? Now, this
is the interface that allows
you to retrieve and update state,
and this is a generic interface that
is private to the iKernel.
You're going to be able to retrieve the state
or even perform some computation off the critical path.
There will be logic that will be
generated specifically for that,
and this logic is going to be accessing some of the state
that is shared by the
main data path logic in the iKernel.
But this is definitely not on the critical path,
this is a slow path.
You're not expected to
stream data in and out by using this thing.
In particular for this simple example,
in order to set threshold you go in and set it.
This is something that you can
implement any way you want.
All right, so the complete design looks like this.
There is one side that is over here,
it handles only isolation protection, rate-limiting,
anti-spoofing, and also implements
all the interfaces that is mentioned,
and there is this whole thing
that integrates with the network stack.
I will really briefly talk about
the data path specifically
because not only network stack,
that you don't need network stack in some cases,
you can actually bypass it.
So, let's talk about data path.
This is the most interesting part
getting a little bit more technical even further.
So, the first way to do data path as I mentioned,
is to support legacy through POSIX, that's easy.
But, there is another type of interface that
we provide which we call custom ring.
So, custom ring is something for advanced users.
If so far it was really simple, now it's for advanced.
So, what you're going to do here is
you're going to create a special abstraction
that would be somehow
implemented than the NIC that would allow you to
access specific ring both for TX and RX inside the host.
That means that all the network stack
is completely bypassed at this point.
So, the only thing
that makes it very different from RDMA,
for example, is the fact that it's not
RDMA because it's a ring, right.
So, there is a some kind of actual network ring,
but instead of network packets
that need to be parsed by the stack,
there will be actual application messages
that will just be stacked up in this ring.
This is really convenient to basically offload a lot of
the network processing off the critical path
on the host if it so happens that
the host is also loaded with packets,
but there are many uses for this if you think about it.
They're really plenty of uses and therefore,
I'm really, really looking
forward to this functionality to actually work.
So, far it's been pretty buggy.
But, once it's available, we can do a lot of things.
We can do actual batching because now application
can get the messages in a batch form,
and we can do application-level steering like
everything that I said that I want to do,
we will be able to do by sheer fact that
we would have this custom ring abstraction.
So, another interesting fact here that is to
a degree downside of
this bump into the wire architecture,
is something that turned out to be
a an interesting twist and the fact in the story.
So look at this, you have a FPGA,
the actual structure of the NIC looks like this.
There is a FPGA with two physical interfaces,
the do, the Ethernet and physical layer.
Then, there's a NIC and then there is a CPU.
What happens here on the way is that from here to here,
it's a basically lossless, lossy Ethernet.
There's nothing special about this interface for
the iNIC kind of infrastructure, for the bump into why.
What it really means is that packets can be dropped.
Now, it's totally legitimate to drop
packets in regular Ethernet,
as you can imagine.
But, the problem, yes sir?
>> I'm sorry for interrupting. So, [inaudible]
what that link between the CPU and the NIC,
it's not the PCIE anymore?
>> Right. It is PCIE,
but on the hardware level,
you send packets very reliably,
but it's not RDMA.
So, there is a protocol between
the high level of the stack.
You have Ethernet there.
>> We've got [inaudible] scheme running on the PCIE that will do
the control flow and ensure that the link is lossless.
>> Right. So, PCIE link is lossless.
What is lossy is the fact that
host may not be able to keep up with
the processing the data coming from the NIC.
Again, ignore this part,
ignore this thing, there's nothing, regular NIC.
If you just send packets and the host
cannot keep up, it will start dropping.
That's exactly what's going to happen.
>> The check sum can be wrong.
>> What?
>> The check sum [inaudible].
>> You know checksum
can be wrong is really rare occasion,
but really what happens is that
you're firing 40 gig at the host,
you know, it will not sustain.
So, the buffers will get overloaded
and story is over. It will start dropping.
But, here's the catch,
the packets can be dropped,
but suddenly this whole nice abstraction of extending
application logic into the NIC suddenly breaks.
Because suddenly, there is unreliable links
between one function in
the application and another function in the application.
That really is not good.
So, when we figured that out,
it was too late to fix it properly,
but it turns out that if you think hard,
what you can do is you still can
work around it for most applications
that we were interested in the first stage.
But, this is very clear that we need to
implement the proper flow control
between the iNIC itself and the side of the application.
Because otherwise, some applications may just not work.
So, for example, in the application where you have
this number of alerts,
you count the number of alerts.
If there are too many alerts and host starts dropping,
this counter will be much
higher than the actual number
of layers that the host will be able to handle.
All right. So, this is kind of
a nitty-gritty detail. Yes.
>> I'm interested. Are you saying you believe this is
fundamentally the right way to construct this?
Or are you saying this is essentially a bug in the fact
that the FPGA is currently too far from the CPU, right?
There's two ways of addressing this.
When one is to say the current think bug
within the FPGA essentially on the far side of
the NIC is a temporary hack that we do right
now where we basically have to redigitize.
[inaudible] the level of
processing the FPGA and then we lower
again on the real [inaudible].
This is broken hack and the right
thing to do is to say, well,
the NIC really ought to have if you like
a co-processor interface at the right point in the NIC,
and then the FPGA should be plugged
on that co-processor interface.
Then, we don't have all these stupid no raising and
lowering of the data as it goes through the abstraction,
and therefore we wouldn't have this problem
or maybe you were saying,
I guess that the right point for
that co-processor interface and
the NIC is actually still at the point
in which you might have to deal with more
cleverly with full of control
between that processing into the CPU.
So, where are you on this phase?
>> So, it's a complicated question,
some of it I would be happy to take
offline and we've been thinking about it a lot.
Some answer is that
the newer generations of the NICs and also Catapult too,
as a matter of fact, already allows direct link
from the FPGA to the CPU through PCI.
So, effectively, you're not under
the spell of network interface at all.
So, you can really have real flow control there.
Is it a good idea for data path?
I don't know. Quite frankly,
so far we figured
that the ability for the CPU to drop packets is
actually essential to build
applications with high performance.
So, if you would want to do this
back pressure all the way down to the network,
because some application gets stuck,
it's not really clear that this is
the best solution or the global kind
of the ground scale of things.
>> Why isn't dropping packets the
first thing you push in the FPGA?
>> Okay, that's a good point because
FPGA might not really
know that it needs to draw
packets because the host is already overloaded.
So, this is the matter of like
this is a subtle kind of question of flow control,
how far it should go.
Should it affect the FPGA as well or should it affect
only the fact when the FPGA wants to talk to host?
So, and again this is
a really elaborate discussion that
deserves probably a little bit longer than.
>> Elaborate on that if you operate a link
between the FPGA and the NIC
a tiny rate lower than the
link-between the NIC and the CPU.
The FPGA will always know when to draw packets.
As if you open [inaudible] from
the links on the NIC to the CPU is 40 gigs,
the queue will always have an empty FPGA,
and it should be possible to always know when to draw.
I'm not saying that I cannot know when to drop.
That's not the question.
The question, I think, is whether I should be dropping
before on the FPGA or I should be dropping on the host,
or maybe the fact that FPGA
would sit in the right place on
the NIC would alleviate this problem.
Again, the answer is about
lowering the stack up and down.
It's not clear that the solution of putting
the FPGA after the NIC
is going to alleviate this problem.
For example, NIC does not support TCP mostly.
Okay. So, RDMA, maybe it's a different story.
But again, there are a lot of things that could be done
differently in more efficient way if
the FPGA would sit in the right point,
but I'm not sure that
this particular question would be addressed in this case.
Okay. So again, let me rush through
the evaluation part because I think
that's what maybe is a little bit of interest after all.
So, we run in a bunch of applications.
The first three are micro benchmarks, Echo,
Traffic generator that just does bursts of network sands.
Then we have this look through memcached
and top K heavy hitters implemented on the NIC.
So, I want to talk about
memcached because this is the most elaborate,
I think, example and the most interesting one.
So, I'll skip that and we'll get to memcached.
So, memcached thing is
actually a humongously large cache,
as you can imagine, of 4K keys and values each 10 bytes.
So, this is certainly a really,
really prototyper's thing just to
understand how this thing can work.
Eventually, we're not claiming
that this is the real implementation.
So, the idea is that this is a look-through cache.
It's a very simple implementation as I mentioned,
really. Hash table, basically.
It's populated upon GET,
and it's finally invalidated upon SET
and the reason for this is that if
your set requests get dropped then your cache in
the NIC would be out of sync with those with the host.
Okay. So therefore, we definitely need to
be very consistent with what
host has in order to be able to,
and because of this drop which
basically is the reason why it works this way.
So, correctness is guaranteed
in terms of the consistency of the cache with a host.
Okay. So, just one important point here to understand.
The performance results that we got for zero hit rate
and 100% hit rate are remarkably different.
Alright. So, the host is
a really significant bottleneck here.
So, whatever host needs to handle,
it will be 50 times slower than
the NIC can do at that basically line rate.
This is only 70% of line rate because there is
some bug in the shell that does
not allow us to generate packets quickly.
But fundamentally, the throughput here is bounded
by this formula right where with the CPU throughput,
because what happens when CPU becomes
overloaded it starts dropping and at that point,
it doesn't matter whether the NIC is
fast or not, the packet gets dropped.
So, you actually have to back
pressure at the application level.
So, this is the theoretical curve
here that says this is how this should look like.
But the measure, it is apparently very
similar because that's how
we regulate the load generator.
Okay. So, it should not come as
a surprise and then the question becomes, "Okay,
you have very fast cache
but it's pretty tiny and what can you possibly do?"
So, here, apparently, first of all,
there are FPGAs and NICs with large memory,
even large fast memory.
But also the zipf's is really nice thing.
The skewed distribution of axes is going to allow us
to handle with really small cache.
With a cache of size one percent we would be able
to handle 50% or 60% hit rate,
which is nice and it still
gives us reasonable speed up here.
Okay. So, I'm going to skip to the lessons
learned and really three slides
maybe, for the bigger picture.
Sorry for taking two minutes more. I'm sorry about that.
So, what are the lessons learned?
Expected lessons for software people
doing some hardware stuff.
Hardware development sucks to the degree that it's just
insane and it burns out students,
professors, and it's really, really,
really, really not convenient.
I think it's an opportunity for
those of us who actually got into
this to introduce something new and useful.
That's what we've done with our debug interface.
I didn't have time to talk about it.
But the bigger picture is,
the world is changing and there will be lot of
programmable stuff lying around in your machine,
and the program, well NICs are just one of those,
but there are also GPUs and a bunch of others.
Although they're all nice and programmable, eventually,
you are faced with a really, really,
really tough challenge which is
the programmability wall, right?
Go and program the system with a bunch of accelerators,
each one having some other constraints and
low level programming frameworks
and whatnot without any abstraction.
So, doing this is really, really tough.
So, the goal that we set out
to achieve several years ago was to break this wall.
Okay. The way we are going to break this wall,
we're going to break out
the CPU-Centric Operating System architecture
in a way that would allow us to get
rid of the control and get rid of the CPU being
the only and single point of contact in order to
do operating system stuff and
ION and get operating system services.
With the idea of building
an Accelerator-Centric Operating System
with the operating system services actually spread
out over all this programmable devices.
We have done a little bit of work on that for GPUs,
with the file system and networking
and RDMA support from GPUs.
We are now doing that with NICs and
the next frontier is storage, and then basically,
the idea is to build
a coherent architecture that would allow something like
Unikernel that would have
a different library OS's
running on each one of those and providing
coherent view of the state of
the system without necessarily
having the CPU involved and all that.
So, this is the bigger picture.
This is where we're heading to.
With that, you can
easily derive the future work that we have.
Apparently, some of it is
to improve what you already have,
but there are also other things that in particular,
the last bullet, is something
that is of particular interest here.
It's the use of NIC to
communicate and to invoke tasks on
other devices and vice-versa.
So with that, I would like to thank you.