Monday, November 24, 2008

How Scalable are my Models?

Recently I noticed an increasing hype around the topic scalability of models, paired with comprehensible concerns about the coherence in between. It is clear that, in the context of Eclipse, we are speaking about EMF, the Eclipse Modeling Framework. To answer the headline question we first need to gather a common understanding of what scalability means in this context. We can summarize the things we can do with a model into two categories:
  • Use the model in main memory
  • Preserve the model state between sessions

For a model to be scalable it is required that the resource consumption for its usage and preservation is not a function of the model size.

Scalability does not necessarily imply that it is always darned fast to use or preserve a single model object. Rather it guarantees that performance and foot print are the same or at least similar, whether the object is the only one contained in a resource or it is contained in a huge object graph. Usually the resource consumption should be a function of the model change.

Some persistence approaches obviously need to violate this constraint. For example saving model changes to a text file will always write the whole model as opposed to only the changes (respective enhancement requests in the EMF newsgroup showed me that this is not as obvious as I thought before). Even loading a single object usually requires to deserialize a whole resource file.

Other persistence systems like relational databases, object-oriented databases or even proprietary random-access files are likely to provide for more scalable preservation of models. An EMF change recorder could be attached to your resource set and the resulting change description could be somehow transformed into a set of modifications, executed against the back-end in O(n) where n is the size of change.

While each model object is usually an instance of a generated subclass of EObjectImpl there are other ancestors available, too. Understanding the cost of an object can be inevitable when trying to measure and optimize the resource consumption. But even if we know the minimum size of a single object it should be clear that we are unable to achieve real scalability just by reducing the size of our objects. The only way to handle models with arbitrary numbers of contained objects is to selectively load the objects that are currently needed into main memory and make them reclaimable by the Java garbage collector immediately after usage. A system with such characteristics does not focus on the size of single objects anymore. So what is preventing our generated models from being scalable?

The default generation pattern of EMF creates subclasses of EObjectImpl for our model concepts. These generated classes contain member fields to store the values of references. At run-time these references strongly tie together our object graph. In EMF there are two basic types of references, containment references and cross references. Traditionally only cross references can be turned into proxies to be resolvable again later. As of EMF 2.4 containment references can become proxies as well, although this requires a non-default generation pattern and possibly the adoption of the application to create and manage additional resources. It is important to note that turning an object into a proxy only sets its eProxyURI and nothing else. Particularly it does not unset any attributes or references. As a result proxies are always bigger than their unproxied pendants and they still carry their strong references to other objects! Go figure…

Now we could try to manually unset the nasty references that still prevent our objects from being garbage collected. But this can be a tedious and error-prone task. Especially bi-directional cross references can be hard to tackle because of the implicit inverse operations when unsetting one end. While it does not seem completely unfeasible it remains questionable whether the EMF proxy mechanism is appropriate to make our models scale well. To sum up:

  • Containment relationships between objects in a resource usually prevent from proxying.
  • Hence only complete resources look like candidates for unloading.
  • Detection of incoming references is expensive.
  • Proxying of incoming references does not automatically influence strong reachability.
  • Manual removal of strong references is at least inconvenient.

It seems as if we are stuck now, but let us step back to look at our model from a distance. In the end, our model is just a directed graph, the nodes are Java objects and the edges are strong Java references. And this last observation seems to be the root cause of our scalability problem! Imagine all these objects had a unique identifying value and all these associations were more like unconstrained foreign keys in a relational database system. We could point to objects without making them strongly reachable. Can we?

Yes, we can! EMF offers a different generation pattern called reflective delegation and a different run-time base class called EStoreEObjectImpl which can be used to implement models that transparently support the needed characteristics. Fasten your seat belt…

Reflective delegation changes the code that is generated for your implementation classes in three ways. Member fields are no longer generated for features. The getters and setters for single-valued features no longer access a member field’s value but rather delegate to the reflective eGet and eSet methods. And the getters for many-valued features return special EList implementations which also delegate to some reflective methods. With this generation pattern we can effectively remove all modeled state from our EObjects, including the unloved strong references. But where does it go instead?

Since we removed the state from our generated classes and the default base class EObjectImpl is not able to store modeled state it is obvious that we need a different base class, which can easily be achieved with the generator property Root Extends Class. While we could write our own implementation of InternalEObject it is usually sufficient to use or subclass EStoreEObjectImpl. Instances of this class delegate all their state access to an EStore which can be provided by the application. We only need to write our own EStore implementation with a dozen or so methods to fulfill the contract and ensure that each EStoreEObjectImpl instance points to an appropriate store instance. I have seen frameworks which maintain a separate store instance for each model object, others let all objects of a resource or a resource set share a single store and others (like CDO, explained later on) are even more complex. I think the right choice depends on how exactly the store is required to handle the object data. Before we dive into CDO’s approach we have to look at a tricky problem that all possible store implementation have to solve.

In addition to the modeled state of an object all stores have to maintain the eContainer and the eContainerFeatureID properties of an EObject. Although it is not immediately obvious the EStore interface only provides methods to get these values but no methods to set them! Since our store needs to provide these values and the framework does not pass them in explicitly we must, if we want or not, derive these values implicitly from the modification method calls (those that can influence the containment) and our knowledge about the model (which are the containment references?). Solving this problem is typically not a one hour task!

Now let us look at how the CDO Model Repository framework faces the problem. Here are some of the requirements for objects in CDO:

  • Loadable on demand, even across containment relationships
  • Garbage collectable, if not used anymore
  • Replaceable by newer versions (passive update) or older versions (temporality)
  • Easily and efficiently transferable through a network wire

These led to a considerably complex design which I am trying to strip down here a bit:

CDO’s implementation of EObject subclasses EStoreEObjectImpl and shares the same store instance with all objects in the resource set that come from the same repository which, together with the virtual current time is represented by a CDOView. CDO’s implementation of EStore is stateless other than knowing its view. The modeled state of an object is stored in CDORevision instances which represent the immutable states of an object between commit operations. The revisions internally store the CDOIDs of target objects instead of strong references to them. Each object stores a strong reference to the revision that is active at the time configured in the view. A view softly or weakly caches objects keyed by their CDOID. The revisions are cached separately in the CDOSession, by default with a two-level cache (configurable fixed size LRU cache plus memory sensitive cache to take over evicted revisions). Since revisions are immutable they can be shared among different local views.

With this design neither the framework nor the objects and revisions keep strong references to other objects or revisions and the garbage collector is able to do its job as soon as the application releases its strong references. The reflective delegation causes each access to a model property to go through the store, which uses the revision of the object to determine the CDOID of the target object. This id is then used to lookup the target object in the view cache. If the object is missing, either because it was never loaded or it has already been garbage collected, the needed revision is looked up in the session cache. The revision always knows the class of the object so that the view can create a new EObject instance and wire it with the revision. If revisions are missing from the sessions cache they are loaded from the repository server.

I kept quiet about a certain aspect to avoid complicating things at the beginning. Notice that not only the framework but also the application is creating new EObject instances to populate the model. Usually this happens through calls to EFactory methods which are unable to provide the new object with the appropriate EStore pointer. It becomes obvious that CDO objects (like all EStoreEObjectImpls without a singleton EStore) generally operate in one of two basic modes, which we call TRANSIENT and PERSISTENT respectively. In the context of repository transactions and remote invalidation we further refined the hyper state PERSISTENT into the sub states NEW, CLEAN, DIRTY, PROXY and CONFLICT. The transitions are internally managed by a singleton CDOStateMachine:

In the TRANSIENT state, i.e. after the object was created but before it is attached to a view, the object has no CDOID and no revision. The store is by-passed and the values are stored in the eSettings array instead. The attach event of the state machine installs a temporary CDOID and an empty revision which is populated through a call-back to the object. During population the data values are moved from the eSettings array to the revision and at the same time the strong Java references are converted to CDOIDs. Finally the object state is set to NEW. The temporary CDOIDs of NEW objects are replaced after the next commit operation with permanent CDOIDs that the repository guarantees to be unique in its scope and all local references are adjusted accordingly.

Notice that no EObject/CDORevision pair is ever strongly reachable by anything other than the application. And the modeled state of an EObject can be atomically switched to older or newer versions by simply replacing the revision pointer. Since a revision does not store any Java references to other entities it’s easy to transfer its data over the wire. With this design it becomes feasible to traverse models of arbitrary sizes.

CDO provides some additional mechanisms to make such traversals even more enjoyable. The partial collection loading feature, for example, enables to page in configurable element chunks of huge lists and the current EStore implementation is able to record model usage patterns which can be used for pre-fetching of revisions that are likely to be used soon.

If you are interested in learning more about CDO you are welcome in the wiki and the newsgroup. You are also invited to attend my proposed talk at EclipseCon 2009: “Scale, Share and Store your Models with CDO 2.0”

14 comments:

  1. Highly interesting post! Thank you!

    ReplyDelete
  2. great article !
    go ahead, provides that to Eclipse and get a free T-shirt ! :)

    ReplyDelete
  3. Yeah great article! Do you know www.terracotta.org technology? I think it could be an interesting complement to provide EMF/CDO scalability.

    ReplyDelete
  4. Impressive article, Eike! I enjoyed the reading :D

    I assume that, as of CDO 2.0 new feature "external references", there should be some peculiarities in what you explained regarding the backing eStore implementation. I suppose there would be an eStore instance per opened view, right? How are objects identified among repositories? Are CDOIDs generated uniquely per repository (there are not 2 same CDOIDs in 2 different repositories), or objects are identified by through a pair RepoID+CDOID? Furthermore, I could imagine an object referencing a second one in a different respository, where the reference could point to an specific state in time of the target object (by only spefying the virtual time when instantiating the CDOView). So the revision could include the timestamp of the target object, or It could be the client the one to decide the timeframe to recover. So much to play around!

    I feel so much attracted by the overall design of EMF and CDO! I can only recommend to anyone interested in programming to take a look at the code and realize how a flexible, extensible and reusable component should be.

    ReplyDelete
  5. Guys, Thank you for your support!

    Jérôme, I remember that I lookedat Terracotta years ago and I found it really interesting. That said, I'm not sure how it would fit into CDO's architecture. Would you suggest it as an alternative to CDO?

    Geeklipse, Yes, each view has its own EStore instance. CDOIDs are unique in the scope of a particular repository, so yes, to identify an object in a multi-repository network you need something like repo-uuid/cdoid. Backend-wise external references are currently only supported by Simon's ObjectivityStore which is likely to be open-sourced soon. Support in the other IStores is a top priority ;-)

    I'm somewhat reluctant to the idea to couple a reference to a target object with a certain version of that target. This approach would not work well with auditing views.

    ReplyDelete
  6. Eike,
    No, I think it can be complement CDO. I also played with Terracotta in order to make a simple Proof Of Concept in my free time... It's very easy to use to scale a simple Pojo model. But when i want apply Terracotta on EMF model, it's a little bit more complex. Indeed, by the source code generated by EMF the objects are being held in EList, which isn't a partial collection. Thus to resolve this problem, it seems the solution is to use the EStore concept in order to handle containment and referencing to allow for partialness. So the kind of problems is already developped in CDO.

    It seems CDO is repository centric, and I think (it's just an idea ;)) use Terracotta with "CDO Core" (the layer in charge of lazy loading and partial collection) would allow to use a big EMF model with a basic file centric way. (without central repository)

    ReplyDelete
  7. Jérôme, I think I would have to know more technical details about Terracotta to judge how and where it could be useful in CDO. I have the feeling that CDO solved some of the same problems that Terracotta is addressing. Yet I can't judge which of CDO's features like (distributed) transactionality and temporality might be uncovered. If anybody knows more about these details and is willing to work on this issue I'd appreciate to take part in discussions ;-)

    ReplyDelete
  8. Hey Eike,

    There is a whole lot to like about CDO. However, it seems to me that in its current incarnation it makes a problematic assumption: that of a reliable, low-latency network (see #1 and 2 of http://blogs.sun.com/jag/resource/Fallacies.html)

    Let's say someone's in the middle of editing a model on their laptop and the wireless suddenly goes out. The next time their program tries to access an object that's in the proxy state, the thread making the access will freeze up for however long the timeout is. If that thread happens to be the UI thread, which it usually is, their whole application becomes unresponsive. Couple this with the fact that the same access will probably be attempted over and over again in futility, and you have some very frustrated users with good reason to be frustrated.

    This is a serious problem, and because if it, I've resorted to maintaining strong references through an EContentAdapter to every object in the repository and re-reading them every time they become invalidated (so they spend as little time in the dangerous proxy state as possible). This is unfortunate, because it eliminates the touted benefits of scalability you outlined in your article, but in my eyes it's much better to avoid the network latency problem that is CDO's Achilles' Heel.

    So I was wondering if you had any ideas for resolving this tension between scalability and tolerance to network unreliability. For me the latter is a much more important concern. Is it possible to address both needs? The fundamental problem I see is that in this system where everything is load-on-demand, the demand may come at a time when the data is not actually available (wireless connection goes dead), and therefore the assumption that a getFoo() method will return immediately with data fails. It's a pleasant dream that the clients of these objects can be agnostic about their network dependencies, but in the real world distributed editors need to be aware of such things so they can give appropriate feedback while remaining responsive.

    It appears that in CDO's current design, any piece of code which wants to access the data of a CDOObject needs to take care to make all getFoo() invocations outside of the UI thread, at least if the object is in a PROXY state. Is that how you would handle this problem?

    Tom

    ReplyDelete
  9. Tom,

    CDO is mainly a "core" technology which includes the client (API and implementation), the wire protocol and the repository server. In addition we provide a small user interface part on top of the client side core API.

    I completely agree with you that the concept of a proxy that is able to resolve itself by demand-loading its state from remote medium is inherently problematic if any of its clients expects method calls like getFoo() to return immediately. The same sort of issues arise with many other types of synchronous method calls, like stream.read(). In general you can almost never be sure that a method call into foreign code returns without blocking for an unpredictable and possibly indefinite time. The only thing we could do is to make some parts of the core functionality asynchronous. But I can't see how this should work for methods like:

    Foo getFoo();

    I don't think that this is an option:

    Future getFoo();

    In the end the client usually calls such methods because it wants to work with the call result immediately after the call. If, and only if, this assumption is not true for a user interface, then I would think that this user interface could be easily changed to reflect this assumption by wrapping all these calls and delegate them to an asynchronous executor service.

    I think you understand what I mean. I'm seeing the problem of blocked UIs, I just don't think that this problem can be transparently solved within the core framework. As you mentioned, the application must be aware of the different characteristics of a networked, distributed medium.

    I'd be glad to see me proofed wrong, though ;-)

    ReplyDelete
  10. Ultimately one such simple thing, and cloud of technologies are doing lot of cost savings for the project. UML2 Model, EObjectImpl Structure(A great structure and easy to read ) are easily connecting to Transformations Engines based on various DSLs like XTEXT....
    This can really speed up development and is a great leap forward for 2011...
    Enteprise Architect
    Maneesh Innani

    ReplyDelete
  11. Hi Eike,

    Have you tried persisting EObjects in a HttpSession, and running into scalability issues since EObjects do not implement Serializable?

    Ken

    ReplyDelete
  12. Hi Ken,

    No, I haven't tried that, yet. And I don't see how scalability is related to the ability, or the lack of it, to use Java serialization with standard EObjects. Maybe you want to elaborate on your problem?

    ReplyDelete
  13. Hi Eike,

    To achieve clustering(session replication) in a web container, the http session objects (in our case, EObjects), needs to implement Serializable.

    We store EditingDomains, ResourceSets in HTTP sessions. And ran into issues when we want to scale, since EMF primarily is not meant for the web i reckon, at least for now?

    So i am just wondering if CDO, Teneo, or any other custom code could help in any way in our case.

    Ken

    ReplyDelete
  14. Ken, I've succesfully used CDO in a web container. You wouldn't store your EObjects in the HttpSession but in CDO.

    ReplyDelete