00:00:12.799
all righty I'm kires um and I'm here to talk
00:00:18.680
to you all about uh vertebra now I've just a little bit about myself I've uh
00:00:25.320
been doing Ruby for about 7 years and have been making a living off of Ruby for pretty much the whole time I've been
00:00:31.119
doing it um for most of that time I did web stuff a lot I did a lot of uh uh
00:00:38.760
websites for businesses for uh a lot of mutual fund things um using Iowa
00:00:45.840
which couple of you may have used um and I also wrote Swift apply and analoger
00:00:53.320
which are two uh tools for for making your stuff faster and uh over the last
00:00:59.680
half year or so I've been working for engine yard on
00:01:05.920
vertebra now cloud computing you know we hear a lot about it these days you know if you w follow the right people on
00:01:12.759
Twitter there's there's constantly things about cloud computing how it's going to bring down costs Etc and uh
00:01:20.920
every day there are more and more people moving towards cloud computing for their deployments engine yard is obviously
00:01:26.400
really committed to the whole Cloud concept um one of the Great great things about clouds is that they're made up of
00:01:34.479
lots of of relatively inexpensive computational units little
00:01:40.799
machines and so it provides a huge array of resources for your software to do its
00:01:46.360
thing now that's also one of the bad things about clouds is because you have
00:01:52.200
all these these machines and how do you manage them how do you distribute work to them you how do you deal with that
00:01:59.560
and that that's uh you know one of one of the the ways that that has been done in the past and it's pretty common way
00:02:06.600
to do it is scripted SSH use you know you have something that goes out and
00:02:12.000
iteratively fires off its stuff to to all your different machines and it works um but for some things it doesn't work
00:02:19.640
so well and when you start getting large numbers of machines it can get bogged
00:02:25.200
down and so that's kind of where vertebrae comes in vertebra is uh designed to be a framework for fault
00:02:32.200
tolerance services that are running inside your Cloud infrastructure now vertebra is um
00:02:40.280
comprise did I unplug myself there we go okay vertebrae is uh
00:02:49.159
comprised of a few different layers
00:02:54.840
now at the bottom layer you've got the vertebrae protocol and that protocol is
00:02:59.879
is built on top of xmpp and there's a whole bunch of design reasons why xmpp was chosen as the
00:03:07.480
protocol um you you can read through these and and that URL at the bottom points to a a
00:03:15.959
more complete document that explains all the design reasons but what it all boils down to is that xmpp has been around for
00:03:21.959
quite a while it's a it's a solid mature specification and had a bunch of features that that were useful and so
00:03:29.519
the decision was made to build this thing on top of xmpp and uh xmpp it's kind of kind of
00:03:37.360
ugly um this is is an actual capture of one full exchange for one operation that
00:03:45.280
occurred between two agents uh running inside vertebrae and so you can see there's a lot of back and forth there of
00:03:51.519
a lot of data it's uh it's a very chatty protocol now one of the upsides of that
00:03:57.879
chatty protocol is that it gives a lot of of information there for us to build
00:04:04.640
fall tolerance in because because of that that back and forth nature uh if we go back to that protocol for a second if
00:04:12.200
you can see that we're using IQ packets and in the xmpp specification a uh IQ
00:04:18.120
set which gets sent out somewhere has to have a result that comes back there's a
00:04:23.560
handshake that occurs and so we can use that handshake along with a couple other
00:04:28.880
features where we uh retry packets that
00:04:33.960
if we don't get a response in enough time we retry them and where we suppress
00:04:39.080
if we keep track of the packets we get if we get duplicates we suppress those suckers and if we do that and then what
00:04:46.080
we have is we have a really fault tolerant system and the goal of that system is so that if you have two agents
00:04:53.680
out there running Network drops off in between them and while while there's an
00:04:59.039
operation going on between those two agents when that Network comes back up that operation should be able to just
00:05:04.520
pick up from right where it left off and and complete without any intervention you know any
00:05:11.800
any issues that's that's what we're aiming for with this thing now sitting on underneath all of
00:05:19.400
this you have to have an xmpp server and the one that we chose for that is ejab D
00:05:24.600
and again there are several reasons technical reasons for that and I gave a URL where you can go and and read those
00:05:30.039
are but what it boils down to is that ejabberd is fairly fast it is solid it's
00:05:35.800
been around a long time I think if I remember right it's its antecedence for 1999 something like that and um using
00:05:43.720
llang it's actually extensible so later on if we wanted to we could take portions of vertebra and move them right
00:05:51.800
back into the xmpp server now the next section of vertebra
00:05:59.800
is the agents and an agent is nothing more than a process that's running out
00:06:04.919
on some machine somewhere that encap encapsulates a set of resources that
00:06:10.919
it's making available to the network and uh so for example you might
00:06:16.360
have an agent that that handles all of your gem operations and so it's going to offer operations for listing your gems
00:06:22.880
for in for installing them for deleting them whatever you might have um an agent
00:06:28.440
that that uh handles logging so it you can write a log and it will go and store
00:06:34.639
at someplace for you and if you want to query logs it can querium for you and these agents sit there out on all your
00:06:41.240
machines in your Cloud however many you've got uh just waiting to do
00:06:47.440
something now once you have all those agents out there though you need a way to talk to them you need a way to to
00:06:53.880
tell them what you want and so what you what you add is you
00:06:59.840
add another item that I've called a client agent and a client agent is really no different than any of these other agents except that it's purpose is
00:07:08.240
to to act as a Communications conduit for you now that client agent if you're
00:07:14.639
going to use one of those client agents and you want to for instance um uh run
00:07:19.800
an operation to query your logs that have been written that client agent needs to have a way to know which of
00:07:27.639
those agents out in the cloud are valid recipients for that operation and so we add in one other piece here and this
00:07:34.840
piece is called heral now heral was originally written in Ruby but um the
00:07:41.639
decision was made to rewrite it in erlang for performance and scalability and heral is it's just like any other
00:07:49.039
agent out there but it has a few important jobs that are Central to the
00:07:54.120
way vertebra Works um first of those is advertising when an agent starts up and
00:08:01.680
one thing that I I in my Spiel I forgot to mention back there just a few minutes ago is that all of those agents sitting
00:08:08.000
out there all those processes those are all ruby processes in vertebra right now our our agent infrastructure is all
00:08:14.639
written in Ruby and so every one of those is a ruby process out there now
00:08:19.680
when one of those things starts up the agent looks at all of the the actors
00:08:25.000
which are you can think of them as as classes of um of functionality it looks
00:08:31.319
at all of the actors that it has defined for it to run and it gathers up all the
00:08:36.640
resources that those things offer and it advertises them to heral and this is a capture of the operations that that take
00:08:44.320
place there and what it does is it sends a packet over to heral saying that we're
00:08:51.000
going to advertise this list of resources aalt takes that information and sticks it in an Amnesia database
00:08:58.279
where it's running and then it just sends a a response back that's essentially an acknowledgement that
00:09:04.160
excuse me that it has received this information so coming back to our client
00:09:10.240
agent there you know that's all fine and great we have all these these agents out there that have advertised what they're
00:09:15.800
doing but there still needs to be a way to um get that information back to to to
00:09:22.800
query it and that's the second part of heral job heral offers Discovery and so
00:09:29.360
so what a client agent will do and I see I stuck the wrong image in there but what a client agent will do is it will
00:09:35.720
send a discover up off to heral and heral will and it'll it'll say I want to
00:09:42.320
discover everybody that offers List letters that that operation haral will go look in its database for all of the
00:09:49.079
agents that have previously advertised that this is what they do and then it sends a response back and it can be like
00:09:54.440
in this example there's just one item in that list or there could be a hundred agents out there that do but it it
00:10:00.200
doesn't matter her all to send back whatever there is to the the client agent now when that happens then the
00:10:08.120
client agent can go ahead and can fire that operation off to those those server
00:10:13.519
agents out there and get the ball rolling on things now problem comes in
00:10:18.800
though because you don't want anybody able to fire any operation off that any
00:10:24.959
of your agents have defined that's you what happens when some Bozo gets in there and starts installing malicious
00:10:31.079
Gems or something or something worse so heral has a third Duty and that's uh
00:10:36.600
authorization and uh so what heral offers what heral does is heral offers a
00:10:43.200
security authorized operation and when one of
00:10:49.120
those those client agents dispatches an operation out to a server agent the server agent looks at who that was that
00:10:55.920
dispatched that operation and it goes off to haral it it does security authorized operation
00:11:02.959
and it asks uh let me go back there it asks heral is this guy allowed to do this
00:11:10.880
operation and heral sends back either an acknowledgement or a a a dis
00:11:17.279
acknowledgement of of that and then depending upon that that response the
00:11:22.680
server agent can either drop the communications or can go on with it now
00:11:30.000
back to to my original example there of of the log query operation um if if
00:11:37.160
you're querying logs you probably don't want a whole bunch of servers and a
00:11:42.839
whole bunch of Agents all spitting back the same information to you from your query you probably only want one of
00:11:49.000
those guys out there in your network to do it and you want to leave the other ones alone and so there's one more
00:11:55.760
component that goes into dispatching operations with vertebrae that that plays into that and that's
00:12:01.360
called scope and right now we've defined two Scopes one of them is a scope of all
00:12:07.079
and uh all just like its name suggests sends to every single one of the agents
00:12:12.600
out there and collects all of those responses back and delivers them to you and then the other scope is a single
00:12:18.760
scope and single scope just sends out to a single one of those agents at random
00:12:24.440
and if that agent blows up if it returns an exception uh something of that name nature then the system will pick another
00:12:31.160
one at random send the query out and we'll keep doing that until it either gets a response back from somebody or it
00:12:37.199
exhausts its list and those aren't the only Scopes that can be defined we could Define a lot more there are several in
00:12:42.519
the works but that's where we what we've got right now and roll all this together
00:12:48.560
and that's vertebrae in a nutshell it's just a way of running s running processes out there getting stuff out to
00:12:56.600
them getting it back in a relatively efficient fault tolerant way and while
00:13:01.800
we were developing this we ran into some there there number
00:13:07.399
of interesting things that have come out of it one of them is um this thing is built on xmpp and if you go back 6
00:13:14.360
months a year ago the main library out there to do this in Ruby was XM pb4 R
00:13:20.680
and it works fine but it's kind of slow it's really heavy on the memory use and
00:13:27.279
it's threaded you know slow slows relative and you know you can you can make
00:13:33.760
arguments either way but in general you want things to go as fast as you can memory use is a big consideration though
00:13:39.600
because if you have say you have a thousand of these running out there in your cloud and each one is using 10
00:13:44.800
megabites more than it needs to you know that comes into some real costs and then the threaded issue is mostly because if
00:13:52.399
all of the internal operations are happening inside threads it makes it really difficult to get in there and
00:13:58.480
finally control what's happening when and and whether it just makes it difficult to
00:14:05.199
manipulate the internals and so we started looking for something else to use now Aman and he's out here somewhere
00:14:13.399
he wrote xmpp for em what he did is he took xmpp p4r and he ripped out the
00:14:19.360
threaded portions and made it evented using aent machine and it did offer better performance but it didn't really
00:14:25.560
help so much with the memory use and so we kept looking around and we came across this thing called loudmouth now
00:14:31.279
loudmouth is a C library and it's written using GB and it's it's event base uses gb's
00:14:39.040
event loop it's pretty fast it's pretty lightweight on the memory and it has Ruby bindings it looked like you know
00:14:45.800
the answer so we went with it and the end result was a faster
00:14:51.800
product with the RAM use dramatically reduced you know I saw agents with you
00:14:57.360
know Footprints of 20 to 30 meab with xmpp p4r and with loudmouth we're at 10
00:15:02.839
megabytes and in the intervening months since then the original uh um owners of
00:15:10.000
loudmouth which is a Swedish firm uh by the name of amendo Michael hendal was
00:15:15.079
the main author um they have have left the project and so we've picked it up
00:15:20.560
and we are carrying it forward so we had loud mouth and taking
00:15:29.279
vertebra which was originally written with xmpp p4r and threaded and we're trying to turn it into an evented system
00:15:36.000
and we needed a way to schedule the work that would happen in these you know
00:15:41.800
since it's an invented system things have to happen fairly asynchronously and
00:15:47.040
um without a lot of blocking and so you know for an example there I have
00:15:53.079
schedule and op to be sent but don't send it unless the E ejabberd connection is live and authenticated um you know
00:15:59.240
that's one one use case we needed we needed some way to encapsulate bodies of work now event machine I mentioned it
00:16:07.440
earlier um it it has a implementation of a the defer pattern in it and this is
00:16:14.800
actually a quote from the documentation there the defer pattern allows you to specify any number of Ruby code blocks
00:16:20.800
that will be executed at some future time when the status of the defer object changes now that's close to what we needed it it's it's in the ballpark
00:16:29.720
um and this is actually don't worry too much about the details of the code there I know it's kind of small but I threw
00:16:35.360
that in just so you can see that the whole implementation of a defer is actually pretty small and this this is
00:16:41.199
the event machine defer with all the comments stripped out now so remember that loudmouth uses
00:16:48.800
GB which has a separate event Loop it it is you know it has its own event reactor in it and so I took the vertebra or the
00:16:56.759
the event machine defer and I changed it to a verteb vertebrae defer with just a few minor changes
00:17:04.079
first of all the event machine defer if if there was a problem when you
00:17:09.480
defined your call backs or your error backs there um it wouldn't give you any feedback and so
00:17:17.160
I changed it so that instead it'll throw a a custom exception if there's an issue
00:17:23.959
there and the next issue was that when you set the def status on a defer and
00:17:30.880
that status can either be succeeded defer or fail failed um the event
00:17:37.360
machine one doesn't return the the result of the of the the blocks that get called and I uh made made my version so
00:17:45.960
that it does pass that information back because sometimes you want to know without having to use some other value
00:17:52.760
passing mechanism you just want the value from those blocks to fall right back out so I applied that change and
00:17:58.960
then I also applied the change there to just make it work with the internals of
00:18:04.440
um GB instead of a vent machine and um crap I put in the wrong slide there I
00:18:10.320
showed you the eternals of a event machine but you get the point um and it
00:18:16.240
was almost good enough almost good enough now there's only one issue when
00:18:22.120
when you have some body of work that's scheduled there's some conditions that need to be met before that body of work
00:18:29.000
runs you know the example I used earlier is we want to make sure that the ejabberd server is connected to and
00:18:34.880
authenticated before we we dispatch that work and so we took that that defer and
00:18:42.360
I made a a a another class out of it that I called synapse probably should have been called neuron
00:18:48.880
but and all all a synapse does is it
00:18:54.039
provides another set of blocks that are called conditions and and those blocks do anything you want and what they need
00:19:01.240
to do is at at the end they either return a true or a false or they can return symbols succeeded deferred or
00:19:08.400
failed and a true is equal to a succeeded and uh so what happens is when
00:19:15.840
that block of code runs if it returns true or succeeded then that condition has been met uh if it returns
00:19:23.200
deferred then what that condition didn't fail it just hasn't been met yet yet so
00:19:29.000
we'll come back and revisit at a later time and if it returns failed then something went wrong and we need to call
00:19:35.559
the the error condition the error call backs instead of the success call backs on a
00:19:42.039
defer with that added in it was it was the perfect thing for for controlling our our blocks of work um and this is a
00:19:49.440
little snippet of of code is showing how it's used uh this is actually comes from
00:19:54.600
the client protocol when we start a a connection or not a connection but when we start a transaction with a server the
00:20:02.640
first thing we want to do is make sure that that our connection is open authen
00:20:07.919
and authenticated and make sure that we don't already have a conversation in process with that particular jid and
00:20:14.400
then if those two conditions are met then we fire off the call back and you
00:20:20.280
know it's as simple as that makes it really simple to Define future units of work that may or may not get executed
00:20:25.840
depending upon a set of conditions and by using the same thing it also makes it possible within
00:20:33.480
an a the agent framework to encapsulate long running operations let's say you have a backup that you want to be able
00:20:40.280
to initiate with one of your agents you know backups can take a long time and you don't want that long running process
00:20:48.039
to be blocking either your agent or to be blocking whatever is is talking to to
00:20:54.039
it on the other end and so we can use the same thing the same with just a little bit more more sugar in there to
00:21:01.760
make it really easy to wrap up long running processes and so an actor synapse just supplies a little bit of
00:21:09.000
sugar there it provides a default condition to evaluate what's happening with that long running condition whether
00:21:15.400
it has completed and there's results to pass on or whether it's still still
00:21:20.480
running and uh what I have here is just sort of a a a um contrived example of a
00:21:26.960
long running operation all it does is generate a string of some length but it
00:21:32.120
does it over a period of seconds and when this runs it doesn't block the
00:21:39.279
reactor uh while it's running because it breaks each of those into a little bit you know a little bit of of uh
00:21:45.600
processing inside a synapse schedules them on on the reactor queue and it just
00:21:51.840
turns through it as it's ready now that's just kind of the start of the concur concurrency AP die um
00:22:00.440
there's other ways that we're going to go with this to make it more friendly to to our users we might do something with
00:22:06.960
fibers we might do something with automatically managed Fork processes you know it's kind of the future but that's
00:22:13.000
kind of the start and coming back to loudmouth there there there was one pain in the butt with it loudmouth used GB
00:22:20.000
and GB uses Ruby Nome and Ruby gnome is a really heavy dependency uh it's kind
00:22:25.440
of a pain in the butt to build on OSX machines um and then the other problem there is
00:22:32.279
that because it has its own event Loop that doesn't play well with an event
00:22:38.080
Loop like out of event machines both of them trying to run in the same Ruby process um you can do
00:22:44.559
it by interleaving the two using timers but
00:22:51.600
it's ug it's really nasty um so there
00:22:57.679
had to be a better way now there's an outside contributor Raphael Simon who works for right scale
00:23:04.720
and right scale uses em for a lot of their stuff and he has been looking at
00:23:09.840
vertebrae for some stuff that they're looking at doing internally and so he needed to interface this and he thought
00:23:16.120
there had to be a better way and he made one uh his requirements were that the GB
00:23:23.279
event Loop shouldn't ever be blocked so that traffic coming into loudmouth xmpp
00:23:28.520
traffic is never held up while Ruby is off doing its thing and it also should
00:23:34.720
be possible for for you to take take some long running process in Ruby and
00:23:40.000
easily um spawn it out without blocking anything as
00:23:45.039
well and so the solution he came up with is to take GB because it's a
00:23:50.279
self-contained C system and run it in a separate thread from Ruby but to manage
00:23:56.559
that from the Ruby binding and then inside the Ruby bindings themselves he's using a vent machine to
00:24:03.400
control all of the
00:24:08.919
communications and so his answer to doing that was to use
00:24:14.840
pipes to communicate between the GB thread and the Ruby
00:24:21.679
thread and uh what what we have here is if this is
00:24:28.440
sort of the the flow of how things go when you're going from Ruby to loudmouth Ruby needs to talk to loudmouth so it
00:24:34.360
writes something into a pipe that's going into loudmouth now loud mouth event Loop picks that up and that that
00:24:42.240
activity on that pipe and it knows that when that happens it needs to pause its
00:24:47.600
own activities and then go tell Ruby to pass the information back to it as to
00:24:53.720
what it needs and then it runs it resumes its Loop and it runs it it's it's a pretty elegant system and then
00:25:00.039
the same thing works in the other direction when loudmouth needs to talk to Ruby it does a very similar sort of
00:25:08.919
thing it pushes that information out to Ruby Ruby picks it up signals back to
00:25:14.600
loudmouth that it got it and then loudmouth can go on with with what it's doing so there's two threads that are
00:25:21.039
running concurrently but because loudmouth is self-contained C it's not
00:25:26.440
Ruby itself um there's no issues with 18 with this compatibility with 186 it
00:25:31.640
works fine with 186 and it should work fine with nine as well I haven't tried it there yet but it should work
00:25:36.720
fine and this was really nice because it eliminated that Ruby gnome dependency uh
00:25:43.320
it did add an event machine dependency but that's a lot lighter weight dependency and the old model is still available in those in those uh Ruby
00:25:50.520
loudmouth bindings it's still pretty fast the RAM usage is still low it does
00:25:55.640
add about a megabyte to the footprint but it's still pretty low and it lets the the agent code use a vent machine
00:26:01.520
for whatever else it needs to use without blocking anything so it really worked out nicely um and then putting
00:26:08.960
all that back together coming back to that whole concurrency issue now that that vertebra
00:26:17.360
employs uh event machine internally event machine has some nice facilities
00:26:22.559
for for spawning processes out and interfacing with them in an evented way
00:26:27.840
um you can do things that take a little bit of time and you don't block anything in the process and this is an example of
00:26:35.080
of an action that does just that now what it does is it spawns a du operation
00:26:42.159
out on the disc and while that spins uh the reactor goes back about its
00:26:48.159
business and then periodically this this little block this little bit action is checked just to see if that that
00:26:56.279
external process has closed or not not if it hasn't closed it goes back to sleep for a while and comes back to
00:27:01.440
check it again and when it has finally closed the data is
00:27:06.640
returned and actually if I can I can show you a quick go into the realm of a
00:27:14.039
live demo here I'll show you a quick
00:27:20.399
if this is just a a simple operation if I didn't kill my stuff before I started
00:27:26.039
uh my talk and maybe I killed it and closed the wrong shell sh okay maybe I
00:27:31.120
shouldn't have gone to Live code um I killed my demo before I started my
00:27:36.960
talk so yeah any
00:27:42.960
questions I have no idea how long I just took because I forgot to turn on Mike's little timer
00:27:50.679
here I got you're in the lights I couldn't see you there who's using ver
00:27:56.080
right now right now um we are are starting to to work on using it
00:28:01.799
internally at engine yard um we're working on some stuff to do DNS provisioning with it and um right scale
00:28:08.720
you know as I mentioned has been putting quite a few resources into it at their end to explore using it for something
00:28:16.519
that I don't know what um and so far that's all that I know about that's
00:28:22.279
really um using it we've had interest from some other groups but
00:28:31.279
you see the relationship between things like vertebra and N
00:28:36.559
um vertebra and nanite are are conceptually similar um you know I I
00:28:43.120
personally haven't done a lot of work with with nanite um but they're they're
00:28:48.919
conceptually similar nanite uses a it's a a simpler more bare brones sort of
00:28:54.600
approach that doesn't have as much of the infrastructure here um but they're
00:29:00.279
conceptually similar items I can't
00:29:08.760
see I think that oh there's one over there is there like a demo app or something we can just unpack and play um
00:29:17.159
there's not a there there are some some examples within the vertebra Source
00:29:22.279
there's um a whole bunch i' I've put in a whole bunch of examples in inside our specs inside our tests and so you can go
00:29:29.440
to GitHub um right there engine yard vertebra
00:29:34.840
actually that should be vertebra RB um go to go to the engine yard area on on GitHub and you'll find it and uh
00:29:42.039
there are examples there we released a 0.4 uh last week and we have a 0.4.1 that fixes fixes a
00:29:51.039
bunch of bugs that if it hasn't been pushed up while I've been here it should be coming up there any time now
00:29:59.000
what's kind of the security model like are you using pki like public key or no there's no public key um and and on one
00:30:06.919
of those those facts back there one of those URL links um oh there goes my timer so I guess I'm about you're the
00:30:13.880
last one
00:30:20.880
um um but what heral manages all of the security information and right now it
00:30:27.760
gets get that information um just via a configuration file and then because
00:30:33.080
ejabberd uh can encrypt all the communications via SSL everywhere we rely on on um the ejab D's security
00:30:41.799
model as far as the the jids and the passwords there to ensure that when an
00:30:48.080
agent receives a request from somebody uh it actually came from who who ebty is
00:30:54.440
saying it came from when it goes to heral to ask for us
00:30:59.480
authentication and since the timer went off I guess I wrap thank you very much