July 19, 2006

Web Services Serialization Format

Linden Lab's upcoming web services format is simple, consistent, robust, readable, flexible, and covers all necessary information faithfully represent REST systems.
We will represent all data with two container types for structure and a set of atomic types serialized to xml.

Atomic types:
* undefined
* boolean
* integer
* real
* uuid
* string
* binary
* date
* uri

Containers:
* array
* map

Attributes and Data

Attributes are only used for encoding parser and formatting instructions. The data in the elements is always data.

Root Element

The root element is llsd. The root must have only one child element which can be any container or atomic type.

Atomic Types

Each atomic type represents one value with type information. An atomic does not have a name, but may have attributes to specify format or processing considerations for the parser. Consumers of atomics are encouraged to massage the data into the preferred native representation, but further serialization should honor the original type information if possible.

undefined

The undefined type is a placeholder to indicate something is there, but it has no value, and will be interpreted as a default value for any other atomic or container type at runtime.

Serialization example

<undef />

boolean

A true or false value.

Serialization examples

<!-- true -->
<boolean>1</boolean>
<boolean>true</boolean>
<!-- false -->
<boolean>0</boolean>
<boolean>false</boolean>
<boolean />

integer

A signed integer value with a representation of 64 bits.

Serialization examples

<integer>289343</integer>
<integer>-3</integer>
<integer /> <!-- zero -->

real

A 64 bit double as defined by IEEE.

Serialization examples

<real>-0.28334</real>
<real>2983287453.3848387</real>
<real /> <!-- exactly zero -->

uuid

A 128 byte unsigned integer.

Serialization examples

<uuid>d7f4aeca-88f1-42a1-b385-b9db18abb255</uuid>
<uuid /> <!-- null uuid '00000000-0000-0000-0000-000000000000' -->

string

A simple string of any character data which is intended to be human comprehensible.

Serialization examples

<string>The quick brown fox jumped over the lazy dog.</string>
<string>540943c1-7142-4fdd-996f-fc90ed5dd3fa</string>
<string /> <!-- empty string -->

binary data

A chunk of binary data. The serialization format is allowed to specify an encoding. Parsers must support base64 encoding. Parsers may support base16 and base85.

Serialization examples

<binary encoding="base64">cmFuZG9t</binary> <!-- base 64 encoded binary data -->
<binary>dGhlIHF1aWNrIGJyb3duIGZveA==</binary> <!-- base 64 encoded binary data is default -->
<binary /> <!-- empty binary blob -->

date

A specific point in time. Intervals or relative dates are not supported. The serialization and parser only understand ISO-8601 numeric encoding in UTC. The time may be omitted which will be interpreted as midnight at the start of the day.

Serialization examples

<date>2006-02-01T14:29:53Z</date>
<date /> <!-- epoch -->

uri

A link to an external resource. The data is expected to conform to [http://www.ietf.org/rfc/rfc2396.txt rfc 2396] for interpretation, meaning, serialization, and deserialization.

Serialization examples

<uri>http://sim956.agni.lindenlab.com:12035/runtime/agents</uri>
<uri /> <!-- an empty link -->

Containers

Containers is a special data type which can contain any other data type including other containers.

map

A map of key and value pairs where key ordering is unspecified and keys are unique. The key is always interpreted as a character string and any character string is acceptable. If there are any elements in the map, it is serialized as a key followed by an atomic or container value. For every key, there must be one value. Well formed and valid serialized maps may contain more non-unique keys. When a deserialized, the implementation should choose one of the the value objects, but that choice is not specified.

Serialization example

<map>
 <key>foo</key>
 <string>bar</string>
 <key>agent info</key>
 <map>
  <key>agent_id</key>
  <uuid>93c73b16-cd86-434d-8b4a-76e12eee950a</uuid>
  <key>name</key>
  <string>testtest tester</string>
 </map>
</map>

array

An ordered collection of data members. Any member can be any atomic or container type.

Serialization example

<array>
 <real>7343.0194</real>
 <array>
  <map>
   <key>offset</key>
   <integer>9847</integer>
  </map>
  <string>da boom</string>
 </array>
</array>

xml-llsd DTD

<!DOCTYPE llsd [
<!ELEMENT llsd (DATA)>
<!ELEMENT DATA (ATOMIC|map|array)>
<!ELEMENT ATOMIC (undef|boolean|integer|real|uuid|string|date|uri|binary)>
<!ELEMENT KEYDATA (key DATA)>
<!ELEMENT key (#PCDATA)>
<!ELEMENT map (KEYDATA*)>
<!ELEMENT array (DATA*)>
<!ELEMENT undef (EMPTY)>
<!ELEMENT boolean (#PCDATA)>
<!ELEMENT integer (#PCDATA)>
<!ELEMENT real (#PCDATA)>
<!ELEMENT uuid (#PCDATA)>
<!ELEMENT string (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT uri (#PCDATA)>
<!ELEMENT binary (#PCDATA)>
<!ATTLIST string xml:space (default|preserve) 'preserve'>
<!ATTLIST binary encoding CDATA "base64">
]>

Example Output

<?xml version="1.0" encoding="UTF-8"?>
<llsd>
<map>
<key>region_id</key>
<uuid>67153d5b-3659-afb4-8510-adda2c034649</uuid>
<key>scale</key>
<string>one minute</string>
<key>simulator statistics</key>
<map>
<key>time dilation</key><real>0.9878624</real>
<key>sim fps</key><real>44.38898</real>
<key>pysics fps</key><real>44.38906</real>
<key>agent updates per second</key><real>nan</real>
<key>lsl instructions per second</key><real>0</real>
<key>total task count</key><real>4</real>
<key>active task count</key><real>0</real>
<key>active script count</key><real>4</real>
<key>main agent count</key><real>0</real>
<key>child agent count</key><real>0</real>
<key>inbound packets per second</key><real>1.228283</real>
<key>outbound packets per second</key><real>1.277508</real>
<key>pending downloads</key><real>0</real>
<key>pending uploads</key><real>0.0001096525</real>
<key>frame ms</key><real>0.7757886</real>
<key>net ms</key><real>0.3152919</real>
<key>sim other ms</key><real>0.1826937</real>
<key>sim physics ms</key><real>0.04323055</real>
<key>agent ms</key><real>0.01599029</real>
<key>image ms</key><real>0.01865955</real>
<key>script ms</key><real>0.1338836</real>
</map>
</map>
</llsd>

Guidelines

XML Encoding

When possible, prefer using us-ascii or or UTF-8 xml encoding.

\P/hoenix

May 28, 2006

Second Life IPC

Other than back-end asset storage and lsl generated calls, the interprocess communication in Second Life falls into one of three broad categories. The low latency messaging system over UDP with application state and packet reliability, remote procedure calls over HTTP in either XMLRPC or our own structured data RPC gammer, and a growing collection of representational state transfers of structured data.

The low latency datagram protocol

Nearly every process used in the operation of Second Life communicates using the low latency messaging system. The message system's primary characteristic is the packet layout which is specified in a template file to facilitate packing extremely compact messages. The message class is represented in 1 to 4 bytes, blocks of related data are specified as a fixed for variable count, and the data fields are typed information blocks of fixed or variable length.
This templetization allows for a fairly easy programmer API for compactly communicating data, but requires that all processes agree on exact encoding of the content in order for it to be decoded. Every time a new feature is added which originates on the client with effects in world, new data which needs to be sent to the client, or information a simulator needs to broadcast to its neighbors, the message template will likely need a change, which will require grid downtime. When an optional viewer update is provided, this is a feature or bug fix which required no template revision.
This scheme is not scaleable in a long-term scenario. Second Life is growing to point where multiple simulator and client versions will need to be running simultaneously on the grid while maintaining communications. Since a new messaging protocol is necessary, no one is eager to release a clear and complete protocol specification.

Remote procedure calls

The first remote procedure call we implemented for Second Life was the login script which accepts an XMLRPC method call from the client and generates a session for the resident. We were pleased with the results of this design since it increased flexibility, testability, and integration with third party languages, tools, and libraries.
The branch of development that was eventually released as 1.9 introduced the concept of structured data, embedded a web-server in the simulators, and started using and XMLRPC implemented in the code as a structured data grammar for remote procedure calls. So far, all agent region changes, instant messages, online notification, and a few other miscellaneous services are implemented as RPC mechanisms. This has worked well, but we are still worried about continued scalability since RPC is reasonable way to request an operation, but not a great way to make a system query since the result cannot be easily cached.
This lack of easy caching has a price. Today, 20% of the central database CPU load is agent online status queries. We are currently replacing that data pipeline with a web based cache that should drastically reduce the load on the database, and once a few more queries are updated, that load will be eliminated.

Representational state transfer of structured data

The agent presence queries and many upcoming features are all based on REST concepts using an XML serialization of structured data. This technique is going to be the focus of new inter-process communication work in Second Life, and will eventually be how all Second Life clients communicate with the grid.

\P/hoenix

March 29, 2006

Fault tolerant != Uncrashable

Some information about the system outage last week:

Last Thursday, around 1:30 AM SLT, the asset server crashed.  The asset server is an essential component of the SL cluster: first the residents noticed slowdowns and missing assets (textures, avatar appearance, etc.), then the entire grid had to be taken offline.  It was a long and painful night for the on-call responders.

The asset server is on a fault tolerant distributed filesystem.  On one hand, this makes the Thursday crash pretty mysterious.  We're not sure exactly why it went down.  On the other hand, the asset server's never crashed like this before, so it's been doing a fairly good job at surviving disk failures and the like.

We're still working on what caused the crash and how to prevent it from happening again.  Going forwards, we're also considering different configurations for system-critical data storage.

~~beez Linden

March 26, 2006

GDC 2006

As the years have passed I have been reaping diminishing returns from the lectures at the game developer's conference, so I spent most of my time there in roundtables. This offered a great opportunity to talk to the more interesting GDC participants with similar interests.

Wednesday

The Sex in Games roundtable was hosted by Brenda Brathwaite, who sometimes laments her new-found position as the 'woman who talks about sex in games.' One new thing I discovered in this roundtable is that everyone already knows Second Life is where it's at for sexual play in an online environment.
The most talked about online erotica environment I heard about is Naughtly America which provides a stylized place for adults to mingle.
At one point, I was put on the spot to provide a good idea for something new in online erotic entertainment. The best idea I could come up with was to provide a cross-world matchmaking service; however, it looks like sexingames.net is already targeting this kind of service.
The most interesting idea I heard was the rapture.net (sorry, I could not find a good URL) which I hear is positioning itself as a P2P multiplayer erotica online experience with persistent characters and toys instead of worlds. It would be cheaper to run than Second Life, and certainly has some interesting angles on charging for P2P content exchanges.

I also attended a historical perspective on sex in video games, mostly presented by Brenda again with Kyle Machulis providing an interesting meatspace demo at the end. Brenda broke down the types, some of the controversy, as well as the issues. Second Life was continually credited with holding the cutting edge on adult content.
One of the recurring themes in producing adult content in games is appropriate diversity, storytelling, and finding what turns people on. I believe that Second Life already has the diversity, and I invite residents to think of clever new ways to develop roles for people to play in the context in Second Life to provide a story and something hot.

The whole day reminded me of the famous Avenue Q song.


Thursday

I attended a roundtable on building the metaverse roadmap hosted by Bridget Agabra Goldstein and Jerry Paffendorf. Again, I am pounded with how secretly popular Second Life has become outside these walls. Jerry and crew are trying to build a cross-business roadmap for where virtual worlds will be going in the next 5, 10, an 15 year timeframe. Check out their website for more details on their goals.
During the discussion, I kept thinking it would be useful to have some way to snapshot your own personal worldview, ala Google Zeitgeist, over time and analyze how that changes.

I also attended the Security and Privacy in Games roundtable, which was not nearly as interesting as the moderator, Elonka.

During a talk on building highly physical game worlds hosted by Sean Blanchfield from Demonware. The whole talk reeked of an attempt from networking middleware to sniff out how to sensibly combine physics and the network, without any participants from the physics middleware providers.

\P/hoenix

February 09, 2006

Anatomy of a preview

I have been working on one of the larger branches of the code-base for a while now to deliver improved scalability, reliability, and resistance to attack. Most of the changes should be well below the radar of detectable changes other than some new improved functionality where agent bans work during region crossing. It just so happens that the estate tools and follow-cam were finished up around the same time, so we are going to preview them together.

This is a handy coincidence since things like follow-cam will pull some of the vehicle script writers into this preview and potentially reveal some of the harder to find corner cases that only appear under load. Though a significant amount of time has been spent trying to put together a test suite for testing load - nothing really imitates the runtime characteristic of the primary grid - agni.

For a full preview test, we usually pull a recent snapshot of the primary grid and import it into one of the two public test grids - siva and durga. That import step takes around 3 hours and does not include assets or inventory. The inventory is pulled from agni during the first login to the test grid and because of some of the mismatch between the snapshot and the inventory pull, the avatar always looks like a mashup of different AVs you've worn. We have a clever way of wiring the assets to get and cache assets that would normally come back 404 from the upstream asset server. That way, the asset server on a test grid can be tiny in comparison to the rack of isilons holding the primary asset array.

After the import, we have to tailor the world to fit the test. This time, we enabled several mainland regions as well as many private islands to test the new estate tools. Occasionally, we give residents a bunch of money or some free land; however, since the other day's deploy included a brand new land and linden purchasing UI widget which has not made it into this branch, we will probably not gift any land credits.

The deploy itself is a combination of tar, gzip, rsync, and dsh to get the entire linden development binaries out to every host. We try as much as possible to make every host exactly the same. This release vastly reduces the dependence on the userserver and it's mutant half-sister the spaceserver. Once those are finally abolished, the only thing not part of the deploy will be the inventory, assets, and central database. A completely homogenous grid vastly reduces operational cost, but is still expensive in developer time.

Once that's ready, and the QA staff has had a chance to approve basic functionality, the preview viewer binaries are pushed to the website. It's important to get a lot of people into the preview if possible, so timing is important. This time, we just did not resolve all of the bugs prior to yesterday's deploy, so we can't lure people into preview because agni is down.

See you in preview.

\P/hoenix

January 13, 2005

Preparing Inventory...

From December 29 - January 6, we had all sorts of login problems and I know a lot of people have wanted a detailed explanation.  Our explanation of "drive failure" IS true, but doesn't really tell the whole story.

On Dec 29, a hard disk in the central login database (a raid 5) died. This isn't normally a problem (it's happened before), but it does seriously degrade the read performance of the database.  Generally speaking, we'd have enough spare capacity for that not to be a problem, but this time we didn't - we'd been planning to replace that machine with a bigger one in a matter of days.  So, what happens when the database can't keep up with load?  In this case, lots of bad stuff - no one could login, some inventory transactions didn't get saved, other things as well.

For extra excitement, last time we swapped a spare disk into this particular model of RAID controller, it crashed and wiped out the entire database, so simply swapping out the bad disk while the system was running seemed like a bad idea.  For even more excitement, the primary replicant of this database, with hardware identical to the master, had recently failed catastrophically and we were still waiting for the replacement.  So, we decided to switch over to the secondary replicant, which was different hardware but *should* have been fast enough.

It wasn't.  That is to say, it was better than the degraded machine, and it kept up with the running grid well enough, but people still had a lot of trouble logging in.  Now this was a real head-scratcher: it made sense that login was slow (querying everyone's entire inventory every time they logged in seemed like a good idea, back when people had dozens of inventory items, not thousands as they often do today), but why did it fail entirely?  Surely there was a bug.  The hunt was on.

In fact, there were three.  One (found and fixed Dec 30 evening): the web server which runs the login CGI had a deadlock condition which, when logins got slow, would cause the majority to fail.  Two (found and fixed Dec 31 afternoon): when the database was slow, it was possible for the userserver to get behind processing login requests, so once you actually had your inventory, it didn't respond in time and the client gave up.  Three (found and fixed Dec 31 evening): if the login CGI took more than about 2 minutes, the connection to the userserver died, so again, once you had your inventory, you couldn’t talk to the userserver at all and your login soon timed out.

Once we got that stuff fixed, people could login reasonably, if not perfectly, reliably.  So, we decided to go ahead with the planned hardware upgrade, and on Jan 6 we switched to newer, more powerful, more reliable hardware.  Since then login has usually been pretty fast.

Why didn't we know about these problems in the login process ahead of time?  It's very difficult to test the behavior of the system under full user-driven load outside of production, and our attention has been focused squarely on large-scale login improvements for 1.6.  With that release, we're removing the monolithic inventory download at login, and replacing it with folder-by-folder, on demand download and caching. People with 30,000 inventory items should see a huge improvement, and login delays should become a thing of the past.

- Ian Linden

December 03, 2004

Twiddling my thumbs

So this is one of the problems with working with big chunks of data. It takes time to slosh them around. Right now, I'm waiting for a big mysqldump to get pushed over our DS3 to where I can use it to garbage collect our asset system. Normally, this would be done over the LAN, but our database machine is really loaded, so we took the mysql dump from an innobackup which we took earlier from a replicated mysql system and was using for mysql testing, and that system is here in our office rather than at the colo.

I'm also waiting on the mysql backup to restore on a system which I'll set up as another replication host. It should be fast enough for us to do more than just accept replication like the other system does. It should be able to handle all the queries that Philip and Ian and everybody else does to gather stats on usage, performance, the economy, etc. in addition to the normal load of replication. So hopefully, Philip will be able to do more speculative queries and find out more interesting information to tell everybody about after I get that running.

Thus, I'm here, twiddling my thumbs, waiting for these things to get done. Rather than do something else and get distracted, I figured I'd do something short and post here as to what was going on. So excuse my blathering, and I hope you find it entertaining!

The biggest problem we have with mysql is that it requires TONS of disk I/O. Thus, we've been testing out a bunch of interesting pieces of hardware that will help extend the life of our database system as it is right now. Hopefully this will give us enough time to re-architect the system so that it won't be a centralized resource, and thus will scale much better. We've got BayDel, NetApp, and Apple FibreChannel disk arrays in-house and are pounding away at them to see which one wins. It's been interesting. This is the first time I've seen PC hardware push this kind of bandwidth to disk. It's pretty cool.

Recently, Ian and I went to LISA, the Large Installation System Administration conference in Atlanta. It was filled full of a bunch of geeks, and was thus an interesting place to hang out (for me, at least, since I'm a rather unrepentant geek). I attended some good talks about configuration management, and it left me inspired. Sadly, I've been engrossed in working on making our asset servers more reliable, testing the DB storage hardware, dealing with dead hardware, and building out a new mysql replication host, so I haven't been able to do much with that inspiration. Someday... Someday!!! It's always interesting to see what other people are doing, and there are some very interesting things going on amongst the LISA crowd. You just have to sit around and eavesdrop a bit to hear about what Genentech or CNN or UW are doing with some giant cluster or group of users. It also helps that a lot of people I know go there, so they always can introduce me to new people who are doing interesting things.

In my spare time, I've been working on trying to tie together our corporate infrastructure too. Right now, we have basically zero centralized authentication. The closest thing we have is a passwd file that is replicated out to every linux box every night. This works reasonably well, but we really need to get some better system going so that we can add and delete users/printers/groups/etc in a central place rather than having to add people to all the different systems we use: samba, linux, the email system, etc. I've been playing with an Apple XServe, running OS X Server, which has this thing called Open Directory. Open Directory can authenticate windows machines by pretending to be an Active Directory server with roaming profiles and printers and whatnot, OS X machines by doing it's default Open Directory thing (basically LDAP+kerberos), and linux machines using pam_ldap and (optionally) kerberos. I'm hoping it will be able to tie together all of our authentication systems into one swell system so that we just have one password for everything, and one place where we can change it. So far, it's looking pretty good, but I've not been able to play with it all that much.

Well, it looks like things are getting close to finished here, so I will sign off. I hope you were entertaianed. Have fun!

-Hungry Linden

October 28, 2004

Ding dong the mail system is dead. Long live the new mail system.

The old mail system was one of those "crap, I need something right now, but it's okay, because I'll have time to fix it later" kind of affairs. Effectively, each sim node would relay outbound email through a central mail host, which would then handle external delivery. As you would expect, this has some scaling problems, pretty much all of which have been experienced in the last 2 weeks in Second Life.

The new system appears to work much better. Each host in our cluster (with a couple of exceptions) handles its own outbound email now. This means that while it's possible for a single region to have slow email response (due to hostile or poorly coded scripts), that region won't affect any other region. This is a Good Thing(tm).

We're using Postfix for our SMTP software; I'm rather fond of it, actually... even when it's under extremely heavy load, it doesn't fall over or spiral out of control and take down the server it's running on. The version we're using is old enough to have some bugs with large mail queues which aggrevated the problem, which was annoying, but shouldn't be a problem in the new configuration. We'll be switching to a nice shiny new version that's much faster as we upgrade hardware to Debian `sarge', anyway, which will easily be able to handle the email load from a single region.

Sounds simple enough, but testing and verifying that this change wouldn't break any existing scripts or any of our code took a while, and was a rather tedious process. I'm happy it's deployed; now, on to the next problem!

October 21, 2004

When daemons attack!

Hello!

Most of you probably don't know me. I'm the newest member of the Ops team, and as such, my job is to keep things running, design and build the backend systems, networks, and security that support both our grids and our company, and try to take some of the hair-pulling out of the struggle that we all deal with when working with computers. I have some experience in working with these sorts of systems, having built There's clusters, and working at Inktomi with their web search farm, but I'm still learning how everything works here. Let me be the first to say that we have a _very_ smart group of people here in Ops and Development and the rest of Linden Lab, and I'm excited to join everybody here!

Yesterday morning, I had the pleasure of experiencing my first cluster meltdown while on gridmonkey duty. I was paged twice, once for a swap space problem at 1:30am on one of our critical central servers (the spaceserver), and then a page that all of our sims were down at 6:30am. Yikes! It looks like the spaceserver got hit by a similar problem to what happened at 1:30am, but this time, the sims weren't able to talk to it because it was way more bogged down than it was earlier. And when the sims can't talk to their spaceserver, they crash.

Not only did the grid crash, but it crashed many times in a row, generating a giant flood of email, since we actually collect information every time a sim crashes and send it in email to all of us so that we can figure out the problem and fix it. This caused even more excitement, as our mailserver got crushed under the load of incoming email. I got paged awake again, and began trying to figure out what was going on. Luckily, some other folks were awake as well, and they know more about the internals of the system than I do, so we were able to work our way through all of the problems that presented themselves:

- throttling mail relays
- too many mysql db connections
- some sim state save failures
- a bug in one of our tools that uploads sim states

Luckily, most of these were reasonably quick to identify and work around, but it took a while to run everything that needed to be run to get the grid up again. And we came up with some good ideas that should help mitigate or prevent future problems like this. So hopefully, we won't have any more outages like we did yesterday. In any case, this is a rather new and unusual failure for us, and as Philip has said, we're working on getting rid of single points of failures like this in our systems, so as time goes on, our exposure to these kinds of problems will be even less and less.

Anyways, I thought you all might like an update from the backstage crew of Linden Lab, to hear our struggles with hardware and software, and to hopefully feel some sympathy for us and our efforts to keep everything going for you folks. Have fun!

-Hungry Linden

October 13, 2004

An asset server is a terrible thing to have burst into flames..

It seems appropriate, somehow, that the first post into the Ops blog should be in reference to hardware going up in smoke.

Ops has been working on getting new, more powerful asset servers online to solve the various and sundry problems that Second Life has been having involving the asset server being very slow under certain loads. Sadly, this is a somewhat involved process: our new asset servers have shiny new RAID cards that we didn't already have a driver for (the state of 3rd party Linux drivers being a topic for a seperate post), and so there was a _lot_ of make kernel... wait... boot.. swear... make again... wait... boot... swear louder... etc.

After spending most of yesterday on that, I was in the colo this morning prepping one of the new servers and installing new RAID hardware into one of our existing (but currently unused) servers. No big deal, totally normal ops work. The hardware install seemed to be going fine. Then I turned the box back on.

At that point, not one, not two, but all three of the power supplies started spewing smoke. Which, of course, caused the smoke alarm to go off.

I've never seen the facilities folks move that fast before.

That pretty much set the tone for today, which then consisted of figuring out how to bootstrap our shiny new RAID controllers without being able to get a driver that worked for our shiny automated installation system.

It's not all bad news, though. The new asset server hardware is installed, and Debian is running on it. Just a bit more configuration and testing, and then hopefully the asset woes will be a thing of the past.