Concepting a new IPC system

Alberto Mardegan

22 gennaio 2011

This is going to be a long post, and probably obscure for anyone not familiar with D-Bus. When I started thinking about this idea, I decided that I wouldn't write about it until I had some code ready; but a recent thread in the D-Bus mailing list urged me to publish my ideas, in the hope that some people might be able to see through all this vapourware and possibly (but here maybe I'm hoping too much) encourage my employer to let me work on this as a R&D project. It's about an IPC system with somehow similar functionality as D-Bus, but whose main describing features would be:

low latency
local only
message data passing happens via shared memory
simple
lock-free
match rules are tags lists
peer-to-peer, with central server acting as a baby sitter :-)
client API as compatible with libdbus as possible

Well, if any of the point above captured your curiosity, read on! I hope it won't be too disappointing at the end. Oh, before we continue, let's give it a name: let's temporarily call it T — once (and if ever) this project sees the light we'll surely find a better name. So, let's walk through some of the concepts, and hope not to lose too many readers along the way. :-)

The big picture

Every client is connected to the T-daemon via socket. The T-daemon maintains the list of the registered service names, monitors their lifetime, maintains the match rules, and very little more. It doesn't come into the picture when a message is passed between two clients, not even when the message has multiple recipients. All the information about the active services which is needed to route a message, such as the list of services, their address and the match rules, all these things reside in shared memory blocks maintained by the T-daemon. The service name concept from D-Bus lives in T as well, almost unchanged: service names are unique across the net, a client can register any number of them at any time, and their appearing and disappearing from the net is advertised by the T-daemon with messages to all interested clients (but in a slightly finer way than D-Bus does). Service names are typically used to route method calls. Normally, messages are created on shared memory blocks, and a very short message containing the location of the block is sent into the receiver(s)'s FIFO (or alternate transport chosen by the receiver, such as a POSIX message queue). Messages are tagged with a list of labels, specified by the sender. On the other hand, clients register match rules to the T-daemon, specifying what labels they are interested in (or better said, one client can specify multiple sets of labels). A match happens if one set of labels defined in a rule is a subset of the message's labels. Share memory areas used for message data passing are not reference counted, and might not always be explicitly freed by the clients; instead, the T-daemon runs a garbage collector which will unlink the oldest ones.

If all this seems cloudy, hairy and dark at the same time, it's because I described it that way. :-) In the next paragraphs I'll try to cover all these concepts in greater detail, at least to the level of detail that got me positive about T's feasibility. But it's very likely that I'll forget writing about some issues which I considered long time ago; so, in case there are some parts of this idea which seem unfeasible to you, please let me know in a comment.

Tags and match rules

A tag, or label, is a string. A match rule is an unordered list of tags, plus an optional sender. If all the tags listed in a match rule are also listed in one message, then the message is delivered to the client which registered that rule. For the sake of my fellow D-Bus experts readers, here is how I would convert a D-Bus match rule into a T match rule: "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='Foo',path='/bar/foo',destination=':452345.34',arg2='bar'" would be a T match rule listing these tags (one tag per line): type='signal' interface='org.freedesktop.DBus' member='Foo' path='/bar/foo' destination=':452345.34' arg2='bar'
and having org.freedesktop.DBus as sender.

This way of handling match rules is simpler and more powerful than D-Bus': for instance, with D-Bus if you want to send a message to clients listening on interface A or on interface B, you need to send the same message twice, once per interface (which is what I had to do in libaccounts-glib, for instance). But with T, you could just tag your message with both A and B, and get it delivered to both clients at once. This taglist form of match rules would also be able to cover the functionality offered by the argX and argXpath components of the D-Bus matches, which are uh, a bit hackish. :-) Moreover, the tagged approach would allow for more fine-grained matches than what D-Bus currently provides with its NameOwnerChanged signal; to explicitly mention a case which forces D-Bus clients to broadly listen to this signal, I recall the Telepathy ChannelDispatcher service, which needs to keep track of service names matching org.freedesktop.Telepathy.Client.* — and not being able to install such a rule, is being waken up whenever NameOwnerChanged is emitted (at least until argXprefix becomes available). In T, one could just install a match rules having the labels T.NewService and org.freedesktop.Telepathy.Client, and be done with it (more about this case in the paragraph "Registering a service name").

Messaging protocol

Clients in the T network exchange messages in a p2p way. The sender creates a shared memory area and writes the contents of the message into it. Once it's done, it sends the handle of this area to the destination(s), by using the transport mechanism which the destination itself has requested when registering the service or match rule. In a humanly readable way, we could imagine the T-address of a client to be fifo://my_fifo or mq://my_message_queue (or something else). The receiver reads the shared memory handle from the input channel and mmaps the area in read-only mode, and unmaps it when it doesn't need it anymore. The memory area is unlinked only at a later stage, by the garbage collector running in the T-daemon: the handle of the memory area has been written by the sender into another memory page which the T-daemon handled to the client when the client first entered the T network. So, the T-daemon knows about all memory areas created by any client, and can go through them and delete the old ones (the minimum validity time might even be chosen by the sender).

To determine what destinations should be reached by a message, a logic like this might be applied:

if the message has a destination name set on it (like in a D-Bus method), the message is delivered to it; all destinations, along with their T-address, are found on a shared memory block maintained by the T-daemon.
if the message has a sender name set on it (like in a D-Bus signal), the sending client reads the table of match rules set for this sender name — which, needless to say, resides on a shared memory area named after the service name and maintained by the T-daemon — and computes the list of destination addresses.
the sending client reads the table of generic match rules: this is another table maintained by the T-daemon, consisting of all match rules which do not have a sender set.

So, in a T net with N registered service names, we have N+1 shared memory objects holding match rules: one for each service, plus one for the rules which are not bound to a service. When sending a message, the list of the recipients is composed by the union of the recipients found according to the three rules above (so, in the worst case, three tables have to be read).

Registering to T

A T client enters the T net in a similar way as it enters D-Bus: it connects to the local socket of the T-daemon and after some pleasant handshaking it is given the handles to the following shared memory areas:

the table holding the list of the connected clients, with the service names they registered and their T-address.
the table holding the match rules which are not bound to a service.
an empty table where the client will write the handles of the shared memory areas it creates for the IPC messages.

The client is also registered into the connected clients table, with a unique service name assigned to it. The client keeps the socket connection to the T-daemon open as long as it intends to be part of the network. This connection is used to register match rules (only the T-daemon is allowed to write those shared memory tables) and service names.

Registering a service name

Clients can register any number of service names, at any time, by issuing a request to the T-daemon. The request will contain the desired service name (which must be unique), the T-address which will listen to incoming messages, and a list of labels. If the T-daemon accepts the request, it will prepare a message containing all these labels, plus a predefined label whose meaning is “a new service has been registered” and will send it to all the clients which have setup a matching rule. The same labels (plus another predefined one) will be set on a similar message which will be send when the service is unregistered.

In addition to the above, the T-daemon will tell the client the handle of the shared memory area where the T-daemon will write the match rules that the client must read when emitting signals from that service name.

Lock-free shared memory handling

During its lifetime, a given shared memory area can have many readers, but only one writer. All the match rule tables and the active clients table are maintained by the T-daemon, which is the only process allowed to modify them. All records are fixed size, and have one field which tells whether they are empty, filled or deleted. Records can only be added or marked for deletion, but never reused. When the table needs to be compacted, the T-daemon creates a new shared memory area with a table holding only the valid records from the original table and, once the new table is ready, its handle is written into a special header of the old table, and after this the old table is flagged as no longer valid, and unlinked. So, when clients need to read data from one table, they would follow this logic:

Check if the table is marked for deletion. If it is, get the handle of the new table and repeat this step; if the next one does not exist, get the new address by directly asking the T-daemon
Find your data, ignoring all records which are not filled or which are marked for deletion.
(under consideration — I don't think this is necessary) After reading the data, check if the record and the table are still valid. If not, go back to step 1.

To keep this efficient and to reduce the memory waste, string data will be accessed by index, with the actual data residing on another shared memory area. For instance, the labels and the service names will stay in their own tables, consisting of a column with the integer hash, one column with the unique numeric ID of the string (which will remain constant over time) and one column with the actual string (a maximum length for labels and service names will be enforced). So, clients which need to refer to string (to tag a message, for instance) will always use the unique integer ID. The concept is the same as glib quarks, except that it's living in shared memory. Match rules would also have a limit on the number of labels, so that their record can be of a fixed size.

The most delicate case is about the shared memory areas into where clients store the handles of the memory areas they allocated for their IPC messages. These tables are maintained by the clients (one table per client, typically), and the T-daemon accesses them with the steps described above. In addition to the status flag and the data (the T-handle of the resource), each record would have a field holding its expiry time (computed from the monotonic clock). The T-daemon GC will ignore records which haven't been expired yet; for the expired ones, it will unlink the shared memory area and inform the client (via the socket connection) of the deleted records, so that the client can update the table. But this is just an option: I didn't fully decide on how to implement this GC accounting. Note that a Linux implementation might simplify things here, and have the T-daemon directly inspect /dev/shm/ to find out the stale memory areas and unlink them, without the need for the clients to bookkeep them.

Performance considerations

Here's a few points I thought worth considering when trying to evaluate the possible performance of a T implementation.

Number of open file descriptors

The first concern you may have is the number of file descriptors used in the T network. First of all, we can forget about the descriptors of the shared memory areas, because they can safely be closed as soon as the memory area has been mmap-ed. Secondly, the number of file descriptors in a net consisting of N clients is a value directly proportional to N (because every client has a socket connection to the T-daemon), plus all the FIFOs. In the typical case, every client will just create one FIFO for its incoming channel, and other clients can write to it (though of course the client could open several different incoming channels for different match rules — but I'm not even sure I want to expose this via the client API). In the worst case, at a certain time we might have that the number of open descriptors is O(N²), which doesn't sound that good. This can be reduced back to O(N), if clients don't keep their communication channels open after having sent a message. One reasonable compromise would be to always close the descriptor after having sent a message except in the case where the message had a destination name set (that is the analogue of a D-Bus method call); this file descriptor could be kept open until we are sending a message to a different destination. If the client API will offer a concept of “proxy to a remote service”, similar to the glib DBusGProxy, then maybe the descriptors serving the T-address of the proxy could be kept open as long as the proxy object is alive.

Transport for IPC communication

For the IPC signaling, I've already mentioned the T-address: this could be a POSIX FIFO, message queue, UNIX domain socket or any other transport which can be used to deliver a hundred bytes to another process. Even though I didn't research this deeply, I would expect the POSIX FIFO to be the simplest to use, so it could be the default. POSIX message queues are said to be faster and provide lower latency, but on the other hand they seem to have some limits imposed on their availability (in Linux, the default number of queues is 256, and this is system-wide). One nice possibility is to use message queues only for processes which need lower latency: so, for instance, an implementation of Telepathy on a mobile phone could request the phone UI, the telepathy-mission-control and the SIM connection manager to use a message queue as their T-address, to achieve a better responsiveness on cellular call handling.

Number of match rules

Match rules are not all stored in a single table, but for every service name there is a set of rules which other clients have installed for it (and there's one table for the the rules without service name, which would only be used by programs like dbus-monitor); I wouldn't expect to find many items in these tables, so they could be read linearly. Moreover, we might add to the client API some flag to allow clients to request that a message be sent to its destination only, ignoring all sniffers.

Memory usage

The data of messages exchanged in the T network reside on shared memory blocks, with one block per message. I considered the possibility of using an allocator to optimize the usage of the shared memory, so that the same block could be segmented and used for different messages, but at least for the initial implementation I'd leave this out — it would introduce some complexity, for uncertain benefits.

Another point of concern is related to the garbage collector: if messages expire only after a timeout which could be of the order of several seconds (D-Bus default timeout is 30 seconds, IIRC, and something similar would probably be expected from T), one could easily get into the situation where processes are creating lots of messages in a rapid succession, which would be deleted only after several seconds. But there are several possible solutions for this, or at least to limit the damage: all the messages delivered to only one recipient (that is, most of the messages in a typical D-Bus session) could be marked with a special flag that tells the recipient that no one else needs access to the memory handle, so that it can unlink it as soon as it has opened it. And if a message is delivered to more than one recipient, it could be delivered to the T-daemon as well, which could remove the handle after a shorter timeout than the actual expiration.

If and when an implementation of T will be ready, we'll be able to make more considerations and optimizations. For instance, for smaller messages addressed to only one recipient it might be more convenient to pipe them directly into the transport, without passing the shared memory handle.

Security

Or, I'd rather say, unsecurity. Exposing all the data, as well as the routing tables, in shared memory is problematic. I'm not concerned about buggy clients: the T client library will be careful enough to open the shared memory areas in read-only mode, so that a process cannot corrupt other processes' data. But anyone could write a malicious client which could do much more harm than a malicious client in D-Bus. This can be mitigated by running the T-daemon as a different user, so that the shared memory areas containing the client list, the tags and the match rules can be trusted. But for data exchanged among clients on shared memory segments, I don't see a clear solution (unless clients also run as all different users, but that's quite impractical). One thing that I miss is the possibility for a process to create a file that only this process can write, while all other processes (running under the same user) can only read. This is because the process which created a file isn't any better, in the “eyes” of the Linux kernel, than any other process running with the same user and group. Even if the creator chmods the file to read-only, other processes can chmod it back to read-write. I'd really hope that someone tells me that I'm wrong on this.

For secure communication, a protocol could be created to allow two clients to open a socket connection between each other, so that data exchanged through it could not be seen or modified by a third process. Ideally, this could happen in an almost transparent way to the clients, which would just need to set one flag when creating one proxy object (if the API will expose such concept), and magically their messages will be routed through it.

Conclusion

Thanks for reading through. :-) I hope I managed to explain all the concepts clearly, and I'm really sorry if I didn't: anyway, I'll follow the comments to this blog post and if necessary provide better explanations. I'm especially curious of hearing opinions about the feasibility of this, or if you think that for some reason T is broken by design. I'm sure that during the implementation one will find more problems, but I cannot think of anything that would be unsolvable or that would force me to rethink this completely. Please don't ask me about numbers of performance measurements; as all good mathematicians, I don't take interest in numbers. ;-) Seriously though, I don't have any; if someone happens to have ready data about performance of FIFOs, sockets and message queues, those can be of help. If you ask me whether this will be faster than D-Bus, I have no doubt about that; but this is just because of the different nature of the network, and because we are trading off a few features which D-Bus has and which I don't consider so important (IPC with remote machines, security against malicious applications). For me, the biggest question mark follows the question whether this will ever be implemented. :-)