reducing dict footprints

In pkgcore back around 2006/2007, we ran into a bit of an issue.  Specifically, during certain usage scenarios it was possible to get all package instances loaded into memory, including their cache contents- essentially we’d get 24000 some package instances in memory, each referencing a dictionary holding cache values.

Obviously, we ran into some memory issues.  Via guppy, we did the usual analysis and leveled __slots__ generally all over the place (most instances were immutable, so this was a good thing also).  For the cache entries, I came up with something of a trick.

For the cache dicts, they always had a guranteed set of keys- they didn’t necessarily always have a key/value pair, but the possible keys were fully known.  If you’re familiar w/ the implementation of hashes, implicitly they have some unused space in them- this is intentional, and usually a good thing (perfect hash functions are rather rare after all).  With python dictionaries however, it’s pretty easy to have the dictionary consuming a fair bit more memory than is actually used by it’s contents due to the bounds of it’s resizing.

What we did is actually pretty simple- slotted instances store their values w/in an array that roughly follows the PyObject struct bits.  This is how they get the memory savings- they don’t use a dictionary for storing instance attributes, they store it w/in that trailing array itself.  In addition to that, they add a getter to the class definition- in a way, it’s essentially a single dict that’s stored in the class itself (to get the getter) instead of a dict per instance.  What I wound up doing is just creating some functionality to automatically create such a class, one that mimics the mapping/dictionary protocol, but has a limited set of keys that can be stored to it.   The end result is snakeoil.obj.make_SlottedDict_kls; given a set of allowed keys, it generates and returns just such a class.

What prompted this posting is that at the time of writing that code (python2.4 or 2.5), there was no sys.getsizeof, so we couldn’t track exactly how much of a saving it leveled (we knew it was a 25% reduction of pmerge invocations, but couldn’t pin down the per instance saving).  In writing out docs for snakeoil, I did some testing using the lovely sys.getsizeof: for a dict w/ a thousand keys, it’s a reduction from 49432 bytes per instance to 8048- roughly 84%.   Around 100 items, it’s near 95% if memory serves- and for the case of only a couple of keys (3 in this testing), it’s a 74% reduction.  Note these stats are from a 64bit machine.

While the differing storage definitely helps, there is a non-obvious memory optimization afoot here- python interns strings up to a certain length, then stops trying (I thought the value was a couple of chars, but in looking at 2.6.5 it looks to be 1 char… weak sauce).  This means that it’s definitely possible to wind up with the two dictionaries that have the same keys, but each key is a seperate string in memory.  With the slotteddict approach, you’re guranteed to get an interned string- so you pay the memory cost of the keys only once, rather than (worst case) per instance.

There are two caveats to using this class:

  1. The allowed set of keys is locked- if you create one of these classes with allowed keys “1”, “2”, “3”,  you cannot do instance[“4”] = some_value; that throws a KeyError. This isn’t an implementation quirk, it’s a raw restriction of cpython’s __slots__ implementation.
  2. These instances are not serializable- this however is an arbitrary limitation, patches welcome. Only reason this is there is I didn’t think of that when I first implemented it- this is a limitation I’ll be removing before snakeoil 0.4 is released.

Presuming those limitations aren’t a problem for people using it, I’d suggest taking a hard look at using it- in snakeoil we have an extension that is used to render this basically at builtin dict speeds in addition.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: