Category Archives: File Systems

File System Compression in HFS+: Space savings and performance gain?

Many modern operating systems offer compression at the individual file level.  This is most useful when it is transparent, allowing all programs and utilities to take advantage of compression without a need for specific programming.  Contrast this with compressed file or archive formats, such as zip, bzip2, gzip, which aren’t typically handled directly by applications and therefore cannot be described as transparent.  As we have discussed previously, the HFS+ file system includes transparent per file compression as of the 10.6 Snow Leopard release.

Benefits of compression

First and foremost, compression saves disk space, and this is the primary benefit.  A secondary benefit is reduced disk I/O, which is the slowest operation on a computer: disk I/O is many magnitudes slower than any CPU instruction due to the physical movement involved: seek time, rotational latency, and transfer time (the last two being dictated by rotational speed of the disk).   Therefore minimizing disk I/O offers great potential for overall performance improvement.  Less disk I/O means that loading a compressed file may actually be quicker then loading the equivalent uncompressed file. But these space and I/O time savings do not come without cost.  When initially written (and for subsequent updates) the file must be compressed, and it must be decompressed each and every time the file is read. This can involve the CPU intensively, particularly when complex algorithms (very space efficient, but CPU intensive) are utilized. With compression we are effectively trading both disk I/O and disk space for CPU cycles.  Compression can function as a performance enhancer only if the CPU cycles required for compression/decompression and the time to read the reduced data on the disk take less time overall than the total disk I/O for an equivalent uncompressed file.  An older single CPU/single core computer may be slower with compression, but multi-core/multi CPUs are now common place and represent the path forward for computer performance.  That is to say, recent computers tend to have “CPU cycles to spare”, whereas comparatively speaking traditional disk drive technology has not kept pace.

HFS+

Individual file compression is new with the Snow Leopard 10.6 release of Mac OS X.  The feature set is quite limited compared to other file systems such as NTFS.  For instance, compressing files is only possible via the terminal ditto command and there is no integration with the GUI.  Lastly, the current functionality is recommended for read only system files, not for end user data files, per the man entry for ditto.  It is expected that Apple will build upon this functionality in future releases of Mac OS X.  However, the functionality that is provided today helps produce the reduced disk foot print of Snow Leopard and likely results in improved performance as well.

In the real world

Let’s check some of our assumptions and see if we really do gain performance by compression.  We will use a custom compression tool named afsctool provided by brkirch to analyze the compression size savings (afsctool can also compress a file in place, unlike the Mac OS X ditto command.)  We will test a variety of file sizes and will compare compressed to uncompressed read performance.

First we confirm the size of the uncompressed version of a medium sized PDF file.

[Mac-Book-Pro]$ ./afsctool -v test-medium.pdf
/Users/user1/compression testing/test-medium.pdf:
File is not HFS+ compressed.
File content type: com.adobe.pdf
File data fork size (reported size by Mac OS X Finder): 5168509 bytes / 5.2 MB (megabytes) /
 4.9 MiB (mebibytes)
Number of extended attributes: 0
Total size of extended attribute data: 0 bytes
Approximate overhead of extended attributes: 0 bytes
Approximate total file size (data fork + resource fork + EA + EA overhead + file overhead):
5169400 bytes / 5.2 MB (megabytes) / 4.9 MiB (mebibytes)

We will then use ditto to compress the file as shown below.

[Mac-Book-Pro]$ ditto --hfsCompression test-medium.pdf test-medium-compressed.pdf

If we check size with the ls -al command, you’ll see that the reported size is the same.

-rw-r--r--   1 tplatt  tplatt    5168509 Nov 25 21:57 test-medium-compressed.pdf
-rw-r--r--   1 tplatt  tplatt    5168509 Nov 25 21:57 test-medium.pdf

However, when we confirm actual on disk size with afsctool we will see

[Mac-Book-Pro]$ ./afsctool -v test-medium-compressed.pdf
/Users/user1/compression testing/test-medium-compressed.pdf:
File is HFS+ compressed.
File content type: com.adobe.pdf
File size (uncompressed data fork; reported size by Mac OS 10.6+ Finder):
 5168509 bytes / 5.2 MB (megabytes) / 4.9 MiB (mebibytes)
File size (compressed data fork - decmpfs xattr; reported size by Mac OS 10.0-10.5 Finder):
 3735071 bytes / 3.7 MB (megabytes)
 / 3.6 MiB (mebibytes)
File size (compressed data fork): 3735087 bytes / 3.7 MB (megabytes) / 3.6 MiB (mebibytes)
Compression savings: 27.7%
Number of extended attributes: 0
Total size of extended attribute data: 0 bytes
Approximate overhead of extended attributes: 536 bytes
Approximate total file size (compressed data fork + EA + EA overhead + file overhead):
 3736352 bytes / 3.7 MB (megabytes) / 3.6 MiB (mebibytes)

You can see a substantial disk space savings of almost 28% was achieved. The on disk size is now 3,735,087 bytes or 3.7 MB. We should expect this will require significantly less disk head movement, and therefore should result in better performance.  The reduced disk read time should more than offset the CPU overhead of having to uncompress the file.  To test this, we’ll  first purge the disk cache, and then simply time the output of the file via cat.  Purging the disk cache is an important step, otherwise the file may be in disk cache (memory) and will not be read from disk.

[Mac-Book-Pro]$ purge
[Mac-Book-Pro]$ time cat test-medium.pdf 1>/dev/null
real    0m1.238s
user    0m0.001s
sys    0m0.025s
[Mac-Book-Pro]$ purge
[Mac-Book-Pro]$ time cat test-medium-compressed.pdf 1>/dev/null
real    0m0.192s
user    0m0.001s
sys    0m0.077s

You can see that our hypothesis was correct.  Reading the compressed file was substantially quicker, but did require more CPU time.  A 5.2 MB file is quite large and therefore a large amount of transfer time is involved.  Will we see significant savings with other file sizes?  I ran some testing using a variety of files, repeated each test 3 times and averaged the results.  Times are in seconds.

As you can see, there was a significant performance improvement provided by compression in all cases but one.  The 45 KB text file showed a performance DECREASE of 36%.  Why would this be?  To understand why this is, you must understand how files are stored on a block device like a hard drive.  (figure out reason here, experimentation in progress, check back later for details) Therefore, there is NO advantage to compression, because the extra CPU overhead to uncompress only adds to the total time to read.   After this revelation, you may wonder why this is the case for the 45 KB text, but not the 8 KB executable or 80 byte text file? Surely these both involve a lengthy read from disk?  These files are so small, they are compressed as an extended attribute (the 8 KB exe) and an inline attribute (the 80 byte text).  This means that the file contents are retrieved when the file’s meta data is retrieved, which eliminates a significant amount of head seek time (A normal file retrieval requires first accessing the metadata (somewhere on the disk) then the actual file contents, which are typically elsewhere, hence a head seek from the meta data position to the file content position is needed.

Summary

Apple has decided to compress the majority of the Snow Leopard system files and it is clear why they would do that, there is both space savings and performance (load time) to be gained (at least with tranditional hard disks, see caveats below).  These system files are generally small and read only.   Compressing everything on the hard disk would likely not be a wise choice as it would negatively affect performance for frequently updated (swap files, database files) or already compressed data (zipped files).  When determining what to compress, the workload and typical uses of the machine must be taken into account.

Caveats

  • I cheated a bit on the 404 MB PDF file as the Apple provided ditto command will not compress a file that large.  I used afsctool with the -c and -5 (zlib level 5 compression) parameters to achieve the compression.
  • These tests were run on a Mac Book Pro with a relatively slow 5,400 RPM hard drive.  Using a faster disk (7,200 RPM or 10,000 RPM) could produce different results as the transfer time component would be reduced.
  • Read only access performance was tested, as updates and writes are not supported currently by the operating system.  The additional overhead of compressing frequently updated files could influence performance negatively.
  • Much of the performance gain is due to rotational disk technology limitations, an SSD (Solid State Drive) would exhibit different characteristics and may not exhibit a performance increase.  Further developments in SSDs – which show great potential in both performance, noise reduction, and power consumption – will likely drastically shape the future of mass storage, but as of now they remain problematic and very expensive.