Monday, November 24, 2008

How Scalable are my Models?

Recently I noticed an increasing hype around the topic scalability of models, paired with comprehensible concerns about the coherence in between. It is clear that, in the context of Eclipse, we are speaking about EMF, the Eclipse Modeling Framework. To answer the headline question we first need to gather a common understanding of what scalability means in this context. We can summarize the things we can do with a model into two categories:
  • Use the model in main memory
  • Preserve the model state between sessions

For a model to be scalable it is required that the resource consumption for its usage and preservation is not a function of the model size.

Scalability does not necessarily imply that it is always darned fast to use or preserve a single model object. Rather it guarantees that performance and foot print are the same or at least similar, whether the object is the only one contained in a resource or it is contained in a huge object graph. Usually the resource consumption should be a function of the model change.

Some persistence approaches obviously need to violate this constraint. For example saving model changes to a text file will always write the whole model as opposed to only the changes (respective enhancement requests in the EMF newsgroup showed me that this is not as obvious as I thought before). Even loading a single object usually requires to deserialize a whole resource file.

Other persistence systems like relational databases, object-oriented databases or even proprietary random-access files are likely to provide for more scalable preservation of models. An EMF change recorder could be attached to your resource set and the resulting change description could be somehow transformed into a set of modifications, executed against the back-end in O(n) where n is the size of change.

While each model object is usually an instance of a generated subclass of EObjectImpl there are other ancestors available, too. Understanding the cost of an object can be inevitable when trying to measure and optimize the resource consumption. But even if we know the minimum size of a single object it should be clear that we are unable to achieve real scalability just by reducing the size of our objects. The only way to handle models with arbitrary numbers of contained objects is to selectively load the objects that are currently needed into main memory and make them reclaimable by the Java garbage collector immediately after usage. A system with such characteristics does not focus on the size of single objects anymore. So what is preventing our generated models from being scalable?

The default generation pattern of EMF creates subclasses of EObjectImpl for our model concepts. These generated classes contain member fields to store the values of references. At run-time these references strongly tie together our object graph. In EMF there are two basic types of references, containment references and cross references. Traditionally only cross references can be turned into proxies to be resolvable again later. As of EMF 2.4 containment references can become proxies as well, although this requires a non-default generation pattern and possibly the adoption of the application to create and manage additional resources. It is important to note that turning an object into a proxy only sets its eProxyURI and nothing else. Particularly it does not unset any attributes or references. As a result proxies are always bigger than their unproxied pendants and they still carry their strong references to other objects! Go figure…

Now we could try to manually unset the nasty references that still prevent our objects from being garbage collected. But this can be a tedious and error-prone task. Especially bi-directional cross references can be hard to tackle because of the implicit inverse operations when unsetting one end. While it does not seem completely unfeasible it remains questionable whether the EMF proxy mechanism is appropriate to make our models scale well. To sum up:

  • Containment relationships between objects in a resource usually prevent from proxying.
  • Hence only complete resources look like candidates for unloading.
  • Detection of incoming references is expensive.
  • Proxying of incoming references does not automatically influence strong reachability.
  • Manual removal of strong references is at least inconvenient.

It seems as if we are stuck now, but let us step back to look at our model from a distance. In the end, our model is just a directed graph, the nodes are Java objects and the edges are strong Java references. And this last observation seems to be the root cause of our scalability problem! Imagine all these objects had a unique identifying value and all these associations were more like unconstrained foreign keys in a relational database system. We could point to objects without making them strongly reachable. Can we?

Yes, we can! EMF offers a different generation pattern called reflective delegation and a different run-time base class called EStoreEObjectImpl which can be used to implement models that transparently support the needed characteristics. Fasten your seat belt…

Reflective delegation changes the code that is generated for your implementation classes in three ways. Member fields are no longer generated for features. The getters and setters for single-valued features no longer access a member field’s value but rather delegate to the reflective eGet and eSet methods. And the getters for many-valued features return special EList implementations which also delegate to some reflective methods. With this generation pattern we can effectively remove all modeled state from our EObjects, including the unloved strong references. But where does it go instead?

Since we removed the state from our generated classes and the default base class EObjectImpl is not able to store modeled state it is obvious that we need a different base class, which can easily be achieved with the generator property Root Extends Class. While we could write our own implementation of InternalEObject it is usually sufficient to use or subclass EStoreEObjectImpl. Instances of this class delegate all their state access to an EStore which can be provided by the application. We only need to write our own EStore implementation with a dozen or so methods to fulfill the contract and ensure that each EStoreEObjectImpl instance points to an appropriate store instance. I have seen frameworks which maintain a separate store instance for each model object, others let all objects of a resource or a resource set share a single store and others (like CDO, explained later on) are even more complex. I think the right choice depends on how exactly the store is required to handle the object data. Before we dive into CDO’s approach we have to look at a tricky problem that all possible store implementation have to solve.

In addition to the modeled state of an object all stores have to maintain the eContainer and the eContainerFeatureID properties of an EObject. Although it is not immediately obvious the EStore interface only provides methods to get these values but no methods to set them! Since our store needs to provide these values and the framework does not pass them in explicitly we must, if we want or not, derive these values implicitly from the modification method calls (those that can influence the containment) and our knowledge about the model (which are the containment references?). Solving this problem is typically not a one hour task!

Now let us look at how the CDO Model Repository framework faces the problem. Here are some of the requirements for objects in CDO:

  • Loadable on demand, even across containment relationships
  • Garbage collectable, if not used anymore
  • Replaceable by newer versions (passive update) or older versions (temporality)
  • Easily and efficiently transferable through a network wire

These led to a considerably complex design which I am trying to strip down here a bit:

CDO’s implementation of EObject subclasses EStoreEObjectImpl and shares the same store instance with all objects in the resource set that come from the same repository which, together with the virtual current time is represented by a CDOView. CDO’s implementation of EStore is stateless other than knowing its view. The modeled state of an object is stored in CDORevision instances which represent the immutable states of an object between commit operations. The revisions internally store the CDOIDs of target objects instead of strong references to them. Each object stores a strong reference to the revision that is active at the time configured in the view. A view softly or weakly caches objects keyed by their CDOID. The revisions are cached separately in the CDOSession, by default with a two-level cache (configurable fixed size LRU cache plus memory sensitive cache to take over evicted revisions). Since revisions are immutable they can be shared among different local views.

With this design neither the framework nor the objects and revisions keep strong references to other objects or revisions and the garbage collector is able to do its job as soon as the application releases its strong references. The reflective delegation causes each access to a model property to go through the store, which uses the revision of the object to determine the CDOID of the target object. This id is then used to lookup the target object in the view cache. If the object is missing, either because it was never loaded or it has already been garbage collected, the needed revision is looked up in the session cache. The revision always knows the class of the object so that the view can create a new EObject instance and wire it with the revision. If revisions are missing from the sessions cache they are loaded from the repository server.

I kept quiet about a certain aspect to avoid complicating things at the beginning. Notice that not only the framework but also the application is creating new EObject instances to populate the model. Usually this happens through calls to EFactory methods which are unable to provide the new object with the appropriate EStore pointer. It becomes obvious that CDO objects (like all EStoreEObjectImpls without a singleton EStore) generally operate in one of two basic modes, which we call TRANSIENT and PERSISTENT respectively. In the context of repository transactions and remote invalidation we further refined the hyper state PERSISTENT into the sub states NEW, CLEAN, DIRTY, PROXY and CONFLICT. The transitions are internally managed by a singleton CDOStateMachine:

In the TRANSIENT state, i.e. after the object was created but before it is attached to a view, the object has no CDOID and no revision. The store is by-passed and the values are stored in the eSettings array instead. The attach event of the state machine installs a temporary CDOID and an empty revision which is populated through a call-back to the object. During population the data values are moved from the eSettings array to the revision and at the same time the strong Java references are converted to CDOIDs. Finally the object state is set to NEW. The temporary CDOIDs of NEW objects are replaced after the next commit operation with permanent CDOIDs that the repository guarantees to be unique in its scope and all local references are adjusted accordingly.

Notice that no EObject/CDORevision pair is ever strongly reachable by anything other than the application. And the modeled state of an EObject can be atomically switched to older or newer versions by simply replacing the revision pointer. Since a revision does not store any Java references to other entities it’s easy to transfer its data over the wire. With this design it becomes feasible to traverse models of arbitrary sizes.

CDO provides some additional mechanisms to make such traversals even more enjoyable. The partial collection loading feature, for example, enables to page in configurable element chunks of huge lists and the current EStore implementation is able to record model usage patterns which can be used for pre-fetching of revisions that are likely to be used soon.

If you are interested in learning more about CDO you are welcome in the wiki and the newsgroup. You are also invited to attend my proposed talk at EclipseCon 2009: “Scale, Share and Store your Models with CDO 2.0”

Saturday, November 22, 2008

Being at ESE, not in Thailand...

Usually in late November we are going to different places in wonderful Thailand for vacation but this year the Eclipse Summit Europe was a month later than the all the other years. So I abstained from a big vacation and headed for the Summit in Ludwigsburg. I arrived there on Monday evening but my luggage did not. Never again I’ll ask for priority baggage when checking in with Air Berlin!

Arriving at the airport Bangkok

Fortunately my luggage was delivered to the Nestor hotel before midnight so I could fully concentrate on the fun of the conference. This fun caused an initial hangover on Tueasday, the symposia day. I missed the modeling symposium because I was told that it was already cramped when I arrived. And I did not prepare a position paper with the required minimum of 2 pages. I’d vote for freeing committers and other persons who are known to be involved so deeply from such effort. Later I was even told that the papers had not been checked very strictly. Anyway, I had some really nice talks to different people on Tuesday.

Me and the jungle on Koh Phi Phi

And I was able to attend the BREDEX GUIdancer presentation, given by Alexandra Imrie, who did a really great job. Later I was amazed that she, coming from Liverpool, could speak German without even the slightest accent. They seem to have a nice tool to create , maintain and execute user interface tests and I am happy that they promised to consider providing me with a free license for my CDO Model Repository project. I also met Ibrahim Sallam from Objectivity, Inc. who is currently preparing the offer of free developer licenses for their wonderful and darned fast OO database system (if used in combination with our EMF/CDO stack). We scheduled a more detailed discussion about this effort for Thursday.

More jungle near Chumphon

In the evening we had dinner in a smaller group and this time in a small restaurant with the lovely local food, which I think of is reason enough to have the summit in Schwabenland every year. Here you get the best Rostbraten, Spätzle and Maultaschen ever. Although we had a lot of fun I left early and went to bed to avoid another hangover during my CDO talk next day.

Lovely Thai food

Wednesday started with (a nice breakfast and) six great talks, the most fascinating for me being the Aspect Weaving for OSGi one. Heiko Seeberger and Martin Lippert presented amazing stuff about their Equinox Aspects project. I promised myself to give it a try as soon as possible. Some time after lunch I headed for Cedric Brun’s interesting talk about Team Work with Models : Compare and Merge with EMF Compare. I appreciated that he finished in time because my talk about the most interesting new features for the upcoming version 2.0 of my CDO Model Repository was the next one to follow.

Dragonfly at a pond on Koh Samui

It seemed that I somehow managed to address both, give an initial impression to the newbies and make existing users look forward to the next release. And our next release will really be a major one. Our small team has already implemented 175 bugzillas since Ganymede, many of them being powerful new features. Special thanks belong to Simon McDuff, who spends considerable part of his parental vacation to provide the CDO community with cool features and friendly support!

A really huge guy

To not repeat my former underestimates of talking time I focused on only very few architectural slides and some code snippets to demonstrate some of the most interesting new features:
External References
Distributed Transactions
Structured Resources
Resource Queries
Explicit Locking
Save Points
Configurable Passive Updates
Change Subscriptions
Query Framework

1, 2, 3, search for me!

I squeezed the large audience through my ten slides in only fifteen minutes, which proofed to be a good decision because even the remaining twenty minutes were not enough to anser all of the questions. I was amazed about the great interest in CDO and particularly noticed the increasing concerns about the scalability of models. CDO transparently addresses this sort of issues for example by loading and unloading single instances on demand or by partially loading huge lists of references. It is unbelievable, yet true, that we can easily traverse models of four gigabytes size or more. Depending on the back-end type chosen we can reach load rates of up to thirty thousand objects per second! I believe that such characteristics, together with the well-thought APIs and our prompt support to the community, caused a lot of the hype we are currently experiencing.

Wonderful biota in Thailand

After my talk I enjoyed the presentation of Gilles Iachelini, Marc Hoffmann and Simon Eggler about „Eclipse on Rails: RCP at the Swiss Railway“. It reminded me to an excellent live presentation they gave to me alone on during one of my business trips to Bern, Switzerland. Thank you guys, again! After that I missed the other presentations to have some more discussions on the floors. Dinner, lots of wonderful wine and the chill-out in the Nestor lobby expanded until five in the morning. As a consequence I missed the keynote on Thursday.

What the heck is that?

Ed Merks’ talk about The Unbearable Stupidity of Modeling clearly was one of the highlights of the whole summit! I’m glad that I was able to attend it. Many of the other talks that I marked as interesting in my schedule became victims of some more private discussions. The only exception was Tom Schindl’s presentation about Writing Datacentric applications with RCP+EMF+Databinding. He excited the audience with some really nice design ideas and, last not least, a demonstration of how easy it is to distribute model changes across machine boundaries with CDO. A meeting with some guys from the automotive scene prevented me from having lunch but the results were so promising that I did not care. That’s why I‘m always carrying some chocolate with me. We enjoyed it together.

Ah, a salesgirl and the wind!

As I mentioned earlier I also continued my discussion with Ibrahim about new licensing models for Objectivity’s OO database. They are currently not only exploring ways to provide free developer licenses for API and server runtimes but could also imagine to provide us (the CDO project) with empty skeleton bundles (EPL licensed) to fake p2 at installation time. I’m really looking forward to see our existing integration with Objectivity as a back-end for CDO model repositories being open sourced in the near future. Unfortunately it appeared that the time ran even faster on Thursday and after a last refreshing beer in the lobby, where most of my Eclipse friends met a last time, I headed towards Stuttgart airport to catch my flight home to Berlin. My luggage arrived with me…

Waiting for the flight back home

Thank you all for making ESE one of the nicest events in 2008 and see you at EclipseCon 2009 in Santa Clara!

Sunday, November 2, 2008

How safe is a "thread safe" data structure?

I think it is obvious that no abstract data type can guarantee that any (non trivial) sequence of invocations is atomic (i.e. not interruptible by other threads using the same instance of the ADT) through mechanisms internal to any ADT implementation. Another good example is the common idiom to insert into a map without replacing:
  synchronized (map)
{
if (!map.containsKey("key"))
{
map.put("key", "value");
}
}
As a consequence each public statement about thread-safety of an ADT or one of its implementations is generally only a statement about behaviour of *single invocations* of the public API. But this statement has some value in its own because the client needs to know whether he is expected to protect single invocations as well. Since the JavaDoc of HashMap states that it is not thread-safe clients must protect the following, too, if concurrent access is possible:
  synchronized (map)
{
map.put("key", "value");
}
Otherwise the map could be internally corrupted. Interesting that in the case of the map ADT there is some special API/implementation couple that solves both problems, internal and to some degree external atomicity. A java.util.concurrent.ConcurrentHashMap guarantees atomicity of single invocations with a possibly higher concurrency than external synchronization. And the ConcurrentMap interface offers the putIfAbsent() operation which executes the "not-containsKey-put" sequence atomically (and even much faster because only one hash lookup is needed!). The following is thread-safe without external synchronization:
  String existingValue = map.putIfAbsent("key", "value");
if (existingValue != null)
{
// New value has not been inserted!
}
Sometimes we have a situation where two data structures together form a unity in the sense that both of them must be modified at a time or none of them to have a consistent state at any time. An example is a bi-directional mapping. While a ConcurrentMap is a good choice for other mutli-threaded scenarios, we usually don't use them for scenrios where multiple data structures are involved because we need external synchronization anyway.
A ConcurrentHashMap allows for higher concurrency than an externally synchronized map but it is more expensive than a completely unsynchronized HashMap. Or in other words, two ConcurrentHashMaps plus external synchronization are much more expensive than two HashMaps with external synchronization.

Apologies for re-iterating stuff that is known to so many already.
It is still Sunday morning...

Alexander and the Gordian Knot

'... What glory's due to him that could divide
Such ravelled interests; has the knot untied,
And without stroke so smooth a passage made,
Where craft and malice such impeachments laid?'
Edmund Waller ...to the King


Original medallion of Alexander.

Well, when I look outside, this first November Sunday is not so nice. Bugzilla is waiting but it is Sunday! Anyway, I read the newsgroup and found an interesting post about a topic I discussed at different occasions with Ed Merks and other people in the recent past. I looked out of the window and decided to answer. Hold on. I looked again out of the window and said to myself "Is this the moment to start my own blog?". I always thought that I have not much important to tell in the public but then the idea: I solve this problem by declaring "this blog is not intended to tell important things!". Looking at some other blogs this seems to make sense although I know that beauty is not the only thing in the eye of the beholder.

In the end this blog enables me to pull my name in the public mud myself before others do it. For example Ed recently blogged about my cool, EMF based, model repository framework CDO and its relation with the world's financial crisis, as well as some security problems in Microsoft products. On the other hand, given the ten overly productive chinese (is this PC??) locked in Ed's home office, writing all these articles, it is pretty unlikely that I can blame myself for the next disaster much earlier.

Maybe it is best instead to try writing some posts that are semi important at least. I leave that to the eye of the aforementioned beholder, you. Watch out my next article about thread safety...

A super nova knot.

Ahh, apologies that I also leave it to you to find a relation between the gordian knot and modern software development, or not. I need to go, the sun comes out, and it is Sunday...