- Use the model in main memory
- Preserve the model state between sessions
For a model to be scalable it is required that the resource consumption for its usage and preservation is not a function of the model size.
Scalability does not necessarily imply that it is always darned fast to use or preserve a single model object. Rather it guarantees that performance and foot print are the same or at least similar, whether the object is the only one contained in a resource or it is contained in a huge object graph. Usually the resource consumption should be a function of the model change.
Some persistence approaches obviously need to violate this constraint. For example saving model changes to a text file will always write the whole model as opposed to only the changes (respective enhancement requests in the EMF newsgroup showed me that this is not as obvious as I thought before). Even loading a single object usually requires to deserialize a whole resource file.
Other persistence systems like relational databases, object-oriented databases or even proprietary random-access files are likely to provide for more scalable preservation of models. An EMF change recorder could be attached to your resource set and the resulting change description could be somehow transformed into a set of modifications, executed against the back-end in O(n) where n is the size of change.
While each model object is usually an instance of a generated subclass of EObjectImpl there are other ancestors available, too. Understanding the cost of an object can be inevitable when trying to measure and optimize the resource consumption. But even if we know the minimum size of a single object it should be clear that we are unable to achieve real scalability just by reducing the size of our objects. The only way to handle models with arbitrary numbers of contained objects is to selectively load the objects that are currently needed into main memory and make them reclaimable by the Java garbage collector immediately after usage. A system with such characteristics does not focus on the size of single objects anymore. So what is preventing our generated models from being scalable?
The default generation pattern of EMF creates subclasses of EObjectImpl for our model concepts. These generated classes contain member fields to store the values of references. At run-time these references strongly tie together our object graph. In EMF there are two basic types of references, containment references and cross references. Traditionally only cross references can be turned into proxies to be resolvable again later. As of EMF 2.4 containment references can become proxies as well, although this requires a non-default generation pattern and possibly the adoption of the application to create and manage additional resources. It is important to note that turning an object into a proxy only sets its eProxyURI and nothing else. Particularly it does not unset any attributes or references. As a result proxies are always bigger than their unproxied pendants and they still carry their strong references to other objects! Go figure…
Now we could try to manually unset the nasty references that still prevent our objects from being garbage collected. But this can be a tedious and error-prone task. Especially bi-directional cross references can be hard to tackle because of the implicit inverse operations when unsetting one end. While it does not seem completely unfeasible it remains questionable whether the EMF proxy mechanism is appropriate to make our models scale well. To sum up:
- Containment relationships between objects in a resource usually prevent from proxying.
- Hence only complete resources look like candidates for unloading.
- Detection of incoming references is expensive.
- Proxying of incoming references does not automatically influence strong reachability.
- Manual removal of strong references is at least inconvenient.
It seems as if we are stuck now, but let us step back to look at our model from a distance. In the end, our model is just a directed graph, the nodes are Java objects and the edges are strong Java references. And this last observation seems to be the root cause of our scalability problem! Imagine all these objects had a unique identifying value and all these associations were more like unconstrained foreign keys in a relational database system. We could point to objects without making them strongly reachable. Can we?
Yes, we can! EMF offers a different generation pattern called reflective delegation and a different run-time base class called EStoreEObjectImpl which can be used to implement models that transparently support the needed characteristics. Fasten your seat belt…
Reflective delegation changes the code that is generated for your implementation classes in three ways. Member fields are no longer generated for features. The getters and setters for single-valued features no longer access a member field’s value but rather delegate to the reflective eGet and eSet methods. And the getters for many-valued features return special EList implementations which also delegate to some reflective methods. With this generation pattern we can effectively remove all modeled state from our EObjects, including the unloved strong references. But where does it go instead?
Since we removed the state from our generated classes and the default base class EObjectImpl is not able to store modeled state it is obvious that we need a different base class, which can easily be achieved with the generator property Root Extends Class. While we could write our own implementation of InternalEObject it is usually sufficient to use or subclass EStoreEObjectImpl. Instances of this class delegate all their state access to an EStore which can be provided by the application. We only need to write our own EStore implementation with a dozen or so methods to fulfill the contract and ensure that each EStoreEObjectImpl instance points to an appropriate store instance. I have seen frameworks which maintain a separate store instance for each model object, others let all objects of a resource or a resource set share a single store and others (like CDO, explained later on) are even more complex. I think the right choice depends on how exactly the store is required to handle the object data. Before we dive into CDO’s approach we have to look at a tricky problem that all possible store implementation have to solve.
In addition to the modeled state of an object all stores have to maintain the eContainer and the eContainerFeatureID properties of an EObject. Although it is not immediately obvious the EStore interface only provides methods to get these values but no methods to set them! Since our store needs to provide these values and the framework does not pass them in explicitly we must, if we want or not, derive these values implicitly from the modification method calls (those that can influence the containment) and our knowledge about the model (which are the containment references?). Solving this problem is typically not a one hour task!
Now let us look at how the CDO Model Repository framework faces the problem. Here are some of the requirements for objects in CDO:
- Loadable on demand, even across containment relationships
- Garbage collectable, if not used anymore
- Replaceable by newer versions (passive update) or older versions (temporality)
- Easily and efficiently transferable through a network wire
These led to a considerably complex design which I am trying to strip down here a bit:
CDO’s implementation of EObject subclasses EStoreEObjectImpl and shares the same store instance with all objects in the resource set that come from the same repository which, together with the virtual current time is represented by a CDOView. CDO’s implementation of EStore is stateless other than knowing its view. The modeled state of an object is stored in CDORevision instances which represent the immutable states of an object between commit operations. The revisions internally store the CDOIDs of target objects instead of strong references to them. Each object stores a strong reference to the revision that is active at the time configured in the view. A view softly or weakly caches objects keyed by their CDOID. The revisions are cached separately in the CDOSession, by default with a two-level cache (configurable fixed size LRU cache plus memory sensitive cache to take over evicted revisions). Since revisions are immutable they can be shared among different local views.
With this design neither the framework nor the objects and revisions keep strong references to other objects or revisions and the garbage collector is able to do its job as soon as the application releases its strong references. The reflective delegation causes each access to a model property to go through the store, which uses the revision of the object to determine the CDOID of the target object. This id is then used to lookup the target object in the view cache. If the object is missing, either because it was never loaded or it has already been garbage collected, the needed revision is looked up in the session cache. The revision always knows the class of the object so that the view can create a new EObject instance and wire it with the revision. If revisions are missing from the sessions cache they are loaded from the repository server.
I kept quiet about a certain aspect to avoid complicating things at the beginning. Notice that not only the framework but also the application is creating new EObject instances to populate the model. Usually this happens through calls to EFactory methods which are unable to provide the new object with the appropriate EStore pointer. It becomes obvious that CDO objects (like all EStoreEObjectImpls without a singleton EStore) generally operate in one of two basic modes, which we call TRANSIENT and PERSISTENT respectively. In the context of repository transactions and remote invalidation we further refined the hyper state PERSISTENT into the sub states NEW, CLEAN, DIRTY, PROXY and CONFLICT. The transitions are internally managed by a singleton CDOStateMachine:
In the TRANSIENT state, i.e. after the object was created but before it is attached to a view, the object has no CDOID and no revision. The store is by-passed and the values are stored in the eSettings array instead. The attach event of the state machine installs a temporary CDOID and an empty revision which is populated through a call-back to the object. During population the data values are moved from the eSettings array to the revision and at the same time the strong Java references are converted to CDOIDs. Finally the object state is set to NEW. The temporary CDOIDs of NEW objects are replaced after the next commit operation with permanent CDOIDs that the repository guarantees to be unique in its scope and all local references are adjusted accordingly.
Notice that no EObject/CDORevision pair is ever strongly reachable by anything other than the application. And the modeled state of an EObject can be atomically switched to older or newer versions by simply replacing the revision pointer. Since a revision does not store any Java references to other entities it’s easy to transfer its data over the wire. With this design it becomes feasible to traverse models of arbitrary sizes.
CDO provides some additional mechanisms to make such traversals even more enjoyable. The partial collection loading feature, for example, enables to page in configurable element chunks of huge lists and the current EStore implementation is able to record model usage patterns which can be used for pre-fetching of revisions that are likely to be used soon.
If you are interested in learning more about CDO you are welcome in the wiki and the newsgroup. You are also invited to attend my proposed talk at EclipseCon 2009: “Scale, Share and Store your Models with CDO 2.0”