Canonical Transforms And Caching
Atelier relies on storing only the original uploaded photo, and then applying transformations to the photo in realtime, caching the results in order to speed up subsequent render requests.
In order to do this, there are two important protocols to be defined. Namely that of caching the results, and producing canonical descriptions of transformations.
Canonical Representation of Transformations
Any transformation to be applied to an image, be it a translation, rotation, convolution, etc; will have a series of arguments to the operation, the operation itself and a source image to work from.
We can represent the source image to work from as a combination of its SHA1sum and length rendered in hexadecimal/ascii and ascii, such as: "f065c6dc45a9cbd4868612eaf36b293b381d11d9+10003" -- This has the benefit of being reasonably hard to deliberately collide with, and extremely unlikely to collide in general use.
Now, we clearly cannot know the sha1sum of the result of a transformation without first performing the transformation. Since we would wish to determine if we have cached the result of a transformation without first performing it to get its sum, we should instead produce a canonical representation of the transformation as a string, and calculate the identifier of that string instead. Since the representation will contain a source image identifier, and also all the information required to perform the transformation it is sufficient to represent the resulting image as the transformation information itself.
Since sequences of transformations are far more interesting than single ones (and often more desirable since it won't produce repeated re-encoding artifacts in lossy compression formats) we consider a chain of transformations to be a single transformation. A transformation is therefore a sequence of lines which will automatically be prefixed by an instruction to load the source image. This prefix+sequence is the transformation which will be hashed in order to construct an identifier for the output of the transformation.
Each line in the transformation is a series of colon separated fields which are individually escaped using %XX escapes for anything other than [A-Za-z0-9.+-,/ ] and since it's unlikely that anything but freeform text input will need escaping, the transforms will, for the most part, be quite readable by humans.
Here is an example of a transform which will convert an image into its thumbnail:
loadfile:8c7f0cd746d181506e1552c4acaf496230efdb47+2906667 crop:440:540:1732:1302 knownfilter:skylight desaturate scale:128:96 watermark:top:right:sans:7:Copyright Daniel Silverstone 2005
Here is the corresponding transform for converting an image to its "medium size" for display:
loadfile:8c7f0cd746d181506e1552c4acaf496230efdb47+2906667 crop:440:540:1732:1302 knownfilter:skylight desaturate scale:640:480 banner:bottom:24px:white:50 text:bottom:left:sans:10:black:Copyright Daniel Silverstone 2005%0aEly Cathedral and Chimneys
Clearly the scripts are identical up to the desaturate command and so we can separate that script out and use it as an intermediate. Since we would be automatically generating that intermediate, we'd clearly choose to store it losslessly, either as TIFF, PNG or PNM (or similar).
Thus we now have a canonical representation of a transformation, and by generating the ID of the transform we know what file to cache the result under.
Cache protocol
The disk cache for Atelier will be a couple of layers of directories and then simply files named by sha1sum+file (the file identity we described earlier). For simplicity the directory layering will be XX/YY/<identity> where XXYY is the first four characters of the identity.
In order to allow for multiple processes to access the cache at any one time there must be a protocol defined for accessing it. Since each individual element of the cache is a simple file, it should be sufficient to consider the file's absence as a lack of the entry in the cache and its presence as a canonical declaration of the content of that file.
So, reading from the cache is simple. Attempt to open the desired file for reading. If you succeed, the content is yours for reading. If you fail, the file is absent.
Writing to the cache is slightly more complex since we obviously don't want multiple processes writing to the same place since that would cause confusion and upset.
As such, there is defined at the top level of the cache, alongside the XX directories, a directory called 'tmp' -- inside that directory a process may create a file under a unique name (suggestions include using libuuid to guarantee it). Having prepared the result of a transform which was previously uncached, a process can then rename() the temporary file to its final name in the cache. If another process has already done this, it's not a problem, we simply duplicated work for a short period of time.
Now, if the cache was append-only, this would be the end of it, however the cache should be periodically cleaned out in order to ensure that it does not run amock and fill up the filesystem on which it is placed. This should be done by periodically (once per hour perhaps) scanning the cache directories for files, ordering those files by st_atime and unlink()ing files least-recently-accessed-first, until the disk usage of the cache falls below some pre-determined limit. Then checking the tmp/ directory for files not accessed within some smaller timeframe (E.g. one hour) and unlink()ing them.
Thus it is important to remember that you cannot stat() a file in the cache and then expect it to be present when you open() it. If you need the content of a che file for more than one open() then first copy it into the tmp/ directory and work from that copy instead.