Saturday, May 11, 2013

OpenAFS for Windows 1.7 IFS: Reaching maturity?

Earlier this week OpenAFS for Windows 1.7.24 was released.  My last post on the IFS was back in early November 2012 after the 1.7.18 release.   In the last six months and 348 commits there have been a number of significant improvements.

I/O Processing

The November blog entry finished with an enumeration of "to do" items that were necessary but unlikely to be implemented in the short term.  Contrary to what was written, a new I/O processing pathway was implemented and made available in 1.7.22.  The primary benefit of the new I/O pathway implementation is sustained throughput during file system store operations.  The prior implementation would stall due to lost races between the afs redirector and afsd_service fighting for ownership over file extents.   In addition to throughput improvements, the new I/O processing pathway permits applications to store data to the file server and bypass both the Windows System Cache and the AFSCache file by specifying the FILE_FLAG_NO_BUFFERING flag when opening the file with CreateFile.

In the last six months a number of bugs were corrected that could result in data corruption when a mixture of file write's were issued to the same file with and without FILE_FLAG_NO_BUFFERING.   One common scenario in which data corruption has been observed is when downloading files via Firefox or Internet Explorer.  In order to make the most efficient use of clock time, the browser begins downloading a file from the web server as soon as the user selects the url.  At this point the browser does not know where to write the data so it stores it in process memory while the user is presented a File Save ... dialog.  Once the browser knows where the data should be written it creates the file in "no buffering" mode and instructs that the all of the data cached in memory be written at once to disk.  The file is then closed and re-opened with normal buffering behavior.  The data corruption would occur when the Windows File Cache received a request to store data in the middle of a 4K page but did not recognize that it must first read the prior contents of the file into memory.  The error was due to a failure to set the ValidDataLength on the file during non-cached (non-buffered) I/O write operations.

Reparse Points and Symlinks

The March entry entitled Symbolic Links on Windows described in significant detail the challenges of working with Symlinks on Windows.   Over the last six months the management of Symbolic Links via the AFS redirector has been altered.  Instead of representing AFS symlinks with the Microsoft assigned reparse point tag value for OpenAFS, the AFS redirector now uses the reparse point tag used to represent NTFS Symbolic Links.  The benefits of this are many:

Reparse Points to Files

There are many applications that are not reparse point aware.  As described in Microsoft's Symbolic Link Effects on File System Functions if a directory entry's attributes include the FILE_ATTRIBUTE_REPARSE_POINT flag the attributes, timestamps and size refer to the reparse point object and not the target of the reparse point even though a normal CreateFile request will open the target object.  Applying the reparse point size to the target object's stream is likely to result in an incorrect end-of-file determination. 

What is most unfortunate is that all versions of .NET and all versions of Java through 1.6 ignore the FILE_ATTRIBUTE_REPARSE_POINT flag.  Of course, from the perspective of application developers that use .NET and Java, the problem is not Microsoft's to solve but a flaw in AFS.   It is viewed as a flaw in AFS because the OpenAFS AFS SMB gateway didn't support reparse points and it did not expose symlinks to the applications.  As a result .NET and Java simply worked.

In 1.7.24 a new registry option has been provided that disable the reporting of Symbolic Links to Files as reparse points.  The 0th bit of the ReparsePointPolicy value when set activates this behavior.   When this policy is activated, directory entries for symbolic links do not contain the FILE_ATTRIBUTE_REPARSE_POINT flag and their timestamps and file size are that of the target file.

Other improvements

There are have a broad range of application compatibility improvements to the network provider interfaces, optimizations of the garbage collection operations, and compatibility with IBM AFS 3.6 file servers for those that still use them, and dozens of other small tweaks.

The Summer months will be spent on Windows 8.1 support and a rewrite of the Authentication Group and Process tracking.  The AuthGroup changes are desired so that Reparse Point Policies can be applied at run time to independent groups of applications.


Credits

The OpenAFS for Windows client is the product of Your File System, Inc., Kernel Drivers, LLC, and Secure Endpoints, Inc.  To support the development of the OpenAFS for Windows client, please purchase support contracts or make donations.  The recommended donation is $20 per client installation per year.

Monday, March 25, 2013

IOZone Performance Measurements of OpenAFS

The I/O processing pathways were rewritten for the OpenAFS 1.7.22 release. One industry standard method of measuring I/O performance in a file system independent manner is the iozone benchmark developed and maintained by Don Capps of NetApp.

http://www.iozone.org/ 

This blog post will compare the iozone results for OpenAFS 1.5.75 which uses the SMB to AFS gateway service and OpenAFS 1.7.23 which uses the new AFS redirector. 

The test environment includes a Lenovo Thinkpad W701ds workstation running Win7-64 as the client system. 8GB ram, dual Core i7 x920 2.00GHz processors (8 cores total), Windows Experience ratings: 
  • Processor; 7.2 
  • Memory: 7.4 
  • Graphics: 5.8 
  • Gaming: 6.5 
  • Disk: 5.9 
The connection to the file server is a 1Gbit wired network through a 10Gbit switch.  The file server is OSX 10.6.8 Server running on a 2010 Mini Server using iSCSI attached storage sharing a single 1Gbit network interface.  The OpenAFS file server is version 1.6.2 using Demand Attach. The AFS cache manager configuration includes: 
  • BlockSize 1 (4KB) 
  • CacheSize 0x200000 (2GB) 
  • ChunkSize 21 (2MB) 
  • RxUdpBufSize 0xc00000 
 All iozone tests were performed using "-Rac output.wks -g 2G".  


Write Performance Comparisons

One of the big complaints with the OpenAFS SMB to AFS gateway is the poor write throughput.   The iozone output for 1.7.75 demonstrates the limitations.  Although the peak throughput for small files (about 1MB) reaches the 30,000 KBytes/second mark, the sustained throughput for larger files is below 16,000 KBytes/second.
OpenAFS 1.5.75 (SMB) Write Performance










The 1.7.23 AFS Redirector does a much better job.  The peak throughput increases with both the record size and the file size.   Depending on the record size the throughput ranges from 30,000 KBytes/second to 65,000 KBytes/second.  This is more than double the peak throughput of the SMB to AFS gateway.
OpenAFS 1.7.23 (RDR) Write Performance

Read Performance Comparisons

1.5.75 read performance is quite inconsistent.   Although there are peak throughput values above 200,000 KBytes/second the majority of record sizes are read at speeds in the 80,000 to 100,000 KBytes/second range.
OpenAFS 1.5.75 (SMB) Read Performance

The 1.7.23 AFS Redirector is faster by a factor of ten.   The majority of record sizes demonstrate read throughput in the 800,000 KBytes/second to 1,000,000 KBytes/second range.
OpenAFS 1.7.23 (RDR) Read Performance

Conclusions

One of the primary goals of converting OpenAFS from a SMB gateway to a legacy file system redirector was a significant improvement in I/O throughput.  The improvements on the read pathway have certainly be obtained.  The 2x improvement in the write path is good but there is certainly room for further improvement.

Sunday, March 24, 2013

Symbolic Links on Windows

 Over the last month I have learned more about symlinks on Windows than I ever wanted to know.  As many readers are aware, I am the lead developer of the OpenAFS client for Windows and the AFS name space supports two symbolic link type objects:
  • Mount Points: a directory entry that refers to the root directory of an afs volume.
  • Symlinks: a directory entry that refers to any absolute or relative target path; traditionally in POSIX notation.
The original AFS client for Microsoft Windows was implemented as an SMB 1.2 to AFS gateway service and it pre-existed Windows 2000, the first version of Microsoft Windows to include NTFS 3.0 and support for reparse points.  Due to the lack of native OS support, AFS specific command-line tools "fs mkmount", "fs lsmount", "fs rmmount" and "symlink make", "symlink list", and "symlink remove" were provided.

In 2007, Peter Scott and I began work on a Windows Installable File System for AFS.  Technically, the new AFS client is a legacy file system redirector driver which has access to the same functionality and flexibility as NTFS.  In Windows Vista and beyond Microsoft added support for symbolic links to files and directories within NTFS.  They implemented this functionality by combining a directory object or a file object with Reparse Point Data.  The data consists of a Reparse Point Tag value (assigned by Microsoft) and a tag specific data structure.

Microsoft assigns reparse tag values and then includes them in future versions of the ntifs.h header file in the DDK.  If you are developing a file system driver for Windows and wish to have a reparse point tag allocated to your driver, follow the instructions at Microsoft's Reparse Point Tag Request page.  Microsoft is likely to assign only a single Reparse Point Tag value for your driver.  Therefore, I recommend that you request a tag value without the "high latency" or "name surrogate" bits set.  You can always combine those bits with your assigned tag value.   The DDK ntifs.h header includes macros to test various bits:
Reparse Points are a generic mechanism for turning a directory or file object into a reference to something else.  The IsReparseTagMicrosoft() macro is important because it determines which data structure will be set on the file system object.  A Microsoft Tag will use the REPARSE_DATA_BUFFER structure whereas a non-Microsoft Tag will use the REPARSE_GUID_DATA_BUFFER structure.  The latter structure can be customized by the driver vendor.  I recommend defining a structure that contains a driver specific sub-tag value and a union of purpose specific values.  In fact, this is what we did for the AFS redirector.

//
// Reparse tag AFS Specific information buffer
//


#define IO_REPARSE_TAG_OPENAFS_DFS 0x00000037L

#define IO_REPARSE_TAG_SURROGATE   0x20000000L

//  {EF21A155-5C92-4470-AB3B-370403D96369}
DEFINE_GUID (GUID_AFS_REPARSE_GUID,
        0xEF21A155, 0x5C92, 0x4470, 0xAB, 0x3B, 0x37, 0x04, 0x03, 0xD9, 0x63, 0x69);

 
#define OPENAFS_SUBTAG_MOUNTPOINT 1
#define OPENAFS_SUBTAG_SYMLINK    2
#define OPENAFS_SUBTAG_UNC        3

#define OPENAFS_MOUNTPOINT_TYPE_NORMAL   L'#'
#define OPENAFS_MOUNTPOINT_TYPE_RW       L'%'

typedef struct _AFS_REPARSE_TAG_INFORMATION
{
    ULONG SubTag;
    union
    {
        struct
        {
            ULONG  Type;
            USHORT MountPointCellLength;
            USHORT MountPointVolumeLength;
            WCHAR  Buffer[1];
        } AFSMountPoint;

        struct
        {
            BOOLEAN RelativeLink;
            USHORT  SymLinkTargetLength;
            WCHAR   Buffer[1];
        } AFSSymLink;

        struct
        {
            USHORT UNCTargetLength;
            WCHAR  Buffer[1];
        } UNCReferral;
    };
} AFSReparseTagInfo;


The motivation behind using reparse points with the AFS redirector is due to limitations of the SMB to AFS gateway. The global AFS name space consists of millions of individual volumes scattered across hundreds or thousands of AFS cells maintained by different organizations. The entire name space can be thought of being rooted at /afs with /afs/ referring to the volume "root.cell" in the cell whose volume location database servers can be found via a DNS SRV query that assumes a one-to-one mapping between the cell name and DNS domain name.  That is too much information but the point is that when the UNC path \\afs\your-file-system.com\ is evaluated by an AFS client the subset of the AFS name space it refers to is unlikely to be a single volume.  This is really important because  the Win32 GetVolumeInformationByHandleW and GetDiskFreeSpaceEx API permits an application to query properties of the volume such as the amount of free space, the volume name, serial number, and system flags.

An SMB share UNC path is assumed to refer to a single volume.  The SMB 1.2 server does not return different volume information for different paths.  It always returns the volume information associated with the root of the share.  For AFS this is a nightmare.  Each AFS volume will have a unique name and id.  They will also have an assigned quota, have a certain number of bytes free, and can be either read-only or read-write.  Since the AFS name space and the potential associated storage is infinite but a single volume has finite constraints what should the GetVolumeInformation and GetDiskFree API families return when given an AFS path?  In the SMB world, AFS claims there is only one volume "AFS", it is read-write, the size of the volume is 2TB and there is always 1TB free.

This lying by the SMB to AFS gateway results in some awkward behaviors.
  • Attempts to open a file for write, create a file, truncate a file, or create or remove a directory on a read-only volume returns ERROR_WRITE_PROTECTED even though the volume properties indicate that it is read-write.  This results in awkward error messages from applications such as the Explorer Shell which checks the FILE_READ_ONLY_VOLUME flag to determine whether operations such as New ..., Rename, Delete, etc should be removed from menus when the active directory is part of a read-only volume.
  • Since the volume size is hard coded to be 2TB with 1TB free, it is not possible for applications to create files that are larger than 2TB.
  • But worse, the Windows SMB client believes that there is 1TB free.  It can accept vast amounts of data from the application before it discovers that in fact there is no room on the file server to store it.  When the space suddenly disappears the application and the user will receive a "Delayed Write Error" which effectively means "I know I promised you that I would safely store your data for you but I misplaced it and you can't have it back."  In other words, a fatal data loss occurs which more often than not will result in application failure and perhaps a monetary loss.
  • Mount points and symlinks objects are not exposed to Windows applications.  The applications believe that there are only directories and files.  This has some really negative consequences.  When an attempt is made to delete a directory object via the Explorer Shell, the shell will delete not only the directory entry but all of the contents of the directory tree below it.  If the directory entry was a reparse point, only the reparse point would be removed leaving the target intact.  Instead, the explorer shell attempts to delete everything.   When a symlink refers to a file, the symlink should be removed but the target should be left alone.   Finally, rename operations should be performed on the mount point or symlink and not on the target object.
When Peter and I designed the AFS redirector one of the goals was to address these short comings.   Implementing reparse points for AFS mount points and symlinks was key because reparse points attributes on directory objects are the indication to an application that the directory entry and its target may not be in the same volume; therefore, the volume and disk free information must be fetched.  Of course, not all applications properly pay attention to reparse point attributes.  Application authors frequently assume that a UNC path or a network drive letter mapping must be to an SMB 1.2 share and therefore can only refer to a single volume.  I am tempted to produce a wall of shame for applications that get it wrong.  However, the failure of application authors to implement the correct behavior in their applications is not a reason for a file system to fail to make the data available to them.

Up until the 1.7.21(00) release the AFS redirector exposed mount points and symlink data using the Microsoft assigned IO_REPARSE_TAG_OPENAFS_DFS tag value and the AFSReparseTagInfo structure wrapped by the REPARSE_GUID_DATA_BUFFER structure.  In principal this should have been fine.  Applications should not need to parse the reparse data in order to properly interpret a reparse point.  The file attributes of the reparse point object indicate whether its a file or a directory.  The high latency bit of the reparse point tag indicates if the target object is located in a Hierarchical Storage Management system that might not be able to queries about the target object in a reasonable period of time.  Unfortunately, many applications decide to ignore the FILE_ATTRIBUTE_REPARSE_POINT flag it is returned by a GetFileAttributes or GetFileAttributesEx call even though these APIs explicitly return information about a reparse point and not the target.   Some applications follow this behavior when the reparse point tag is not recognized which usually means when IsReparseTagMicrosoft() returns false.  Others do it always.

What happens when the FILE_ATTRIBUTE_REPARSE_POINT bit is discarded and the rest of the file attributes are assumed to apply to the target file?  In addition to the file attributes field the GetFileAttributes and FindFirstFile family of functions also return the file size.  Now the file size does not have much meaning when the object is a directory but when the target of the reparse point is a file using the wrong file size can be catastrophic.  File contents can be truncated when read or overwritten when written.  Applications will be mighty confused when they continue to append data to a file but believe the file size never changed.  They will be even more confused when they attempt to delete a file only to find that either the reparse point was deleted or the target file but not both.  Regardless, bad things happen and that leaves end users with a bad taste in their mouths.

For the 1.7.22(00) release I decided to significantly flesh out the reparse point handling.  For starters, I had been working with Rex Conn on adding knowledge of AFS Reparse Points to Take Command.  Take Command (and its predecessor 4NT) have long had excellent support for AFS.  Take Command distinguishes in the directory list symlinks to files, symlinks to directories and junctions.  It does so for AFS as well as NTFS.  When Take Command 15 is combined with OpenAFS 1.7.22 users can not only view the target information for AFS mount points and symlinks but can also create them if the Take Command process has the SeCreateSymbolicLinkPrivilege which permits the CreateSymbolicLink API to create a symlink to a directory or a file.

CreateSymbolicLink encapsulates the following operations:
  1. Determine the type of the target object (file or directory)
  2. Create either a directory or a file object to match the target type
  3. Construct the REPARSE_DATA_BUFFER structure using the IO_REPARSE_TAG_SYMLINK tag
  4. Issue the  FSCTL_SET_REPARSE_POINT to assign the reparse data to the directory or file
  5. Close the handle to the file or directory
In other words, the CreateSymbolicLink only creates Microsoft symlinks.  Since the tag type is in the data structure it is fairly easy for a file system driver to accept both the IO_REPARSE_TAG_SYMLINK data and the file system specific data.  Once implemented it became possible for the Take Command MKLINK command to be used to create symlinks within AFS volumes.

For the longest time I resisted squatting on Microsoft's tag and data structure but as long as FSCTL_GET_REPARSE_POINT returns the IO_REPARSE_TAG_OPENAFS_DFS data many applications do the wrong thing.  There simply wasn't any choice from the perspective of application compatibility.  As a result in the 1.7.23(00) release AFS Symlinks will be exposed using the IO_REPARSE_TAG_SYMLINK instead of the IO_REPARSE_TAG_OPENAFS_DFS tag.  Only AFS Mount Points will be exposed using the IO_REPARSE_TAG_OPENAFS_DFS tag.

With this change not only can Take Command understand AFS symlinks but so can the Explorer Shell, the Cygwin POSIX environment, the PowerShell Community Extensions, and anything else that can manipulate NTFS symlinks.  Even Hermann Schinagl's Link Shell Extension.

One might think that everyone might be happy at this point except that end users are still faced with applications that do not know how to properly interpret Microsoft Reparse Points.  One example is Microsoft's own .NET.  In Microsoft's How to: Iterate Through a Directory Tree (C# Programming Guide) the author explains:

  NTFS file systems can contain reparse points in the form of junction points, symbolic links, and hard links. The .NET Framework methods such as GetFiles and GetDirectories will not return any subdirectories under a reparse point. This behavior guards against the risk of entering into an infinite loop when two reparse points refer to each other. In general, you should use extreme caution when you deal with reparse points to ensure that you do not unintentionally modify or delete files. If you require precise control over reparse points, use platform invoke or native code to call the appropriate Win32 file system methods directly.

That is not the only thing that .NET does.  It also hides the FILE_ATTRIBUTE_REPARSE_POINT bit in the file attributes from applications and returns the file size of the reparse point data.  As a result parsing a file stream through a symlink to a file results in the data truncation bug.   If the .NET team truly wanted to hide reparse points from application developers, they should have substituted the file attribute information for the target files in all directory enumeration output.  Providing compatibility for broken applications such as this should not be the responsibility of a file system.  However, applications are more important to end users than file systems and if the applications do not work, the file system will be replaced (or never adopted in the first place.)   As a result a future version of the Windows AFS client will probably include a mechanism for requesting that Symlinks to Files be reported as Files and not IO_REPARSE_TAG_SYMLINK reparse points.

While on the subject of Symlinks and Windows I would also like to discuss other approaches to implementing symlinks on Windows that have been implemented over the years.  As I mentioned, Cygwin supports Microsoft IO_REPARSE_TAG_SYMLINK reparse points as Symlinks.

$ ls -l af*
lrwxrwxrwx 1 Administrators None 9 Sep 19  2012 afs -> //afs/all

However, "ln -s target link" cannot be used to create IO_REPARSE_TAG_SYMLINK reparse points.  This is because "ln -s" creates Cygwin specific symlink objects in the file system.  Instead of using reparse points, Cygwin writes a file that begins with a cookie "!", followed by a Unicode BOM and the target path in Unicode.  The file has the FILE_ATTRIBUTE_SYSTEM attribute set as an indicator that the file might be a Cygwin symlink.

On Windows Server, Microsoft provides both a POSIX environment, Interix, and an NFSv3 implementation.  Interix implements symlinks similarly to Cygwin except that the cookie is "IntxLNK\1" and the format of the target path is different.  While the NFS implementation identifies its Symlinks by use of an extended attribute,
"NfsSymlinkTargetName" which stores the target path.

There is one more type of link object in Windows which is sometimes interpreted as a symlink.  That is the Windows Shortcut .LNK file which is interpreted by the Windows Shell.  One thing that is quite odd is that Cygwin at the present time is capable of writing .LNK files but is not capable of creating IO_REPARSE_TAG_SYMLINK reparse points.
[Update: Corinna Vinschen of Cygwin indicates the reason is that POSIX paths can be stored in .LNK files but IO_REPARSE_TAG_SYMLINK fields require the use of Windows file paths and foreknowledge of the target type.]

Microsoft Windows Reparse Points are an extremely powerful and flexible mechanism for implementing file system specific control points.  Much more powerful than the traditional POSIX symlink although much more complex.  An example of a tool that is more powerful because of its reparse point awareness is Microsoft's "Robust File Copy for Windows" tool better known as RoboCopy.   RoboCopy can be configured to exclude junction points (/XJ) by which they mean reparse points; exclude junction points for directories but not files (/XJD); exclude junction points for files (/XJF); and even copy the symlink instead of the target (/SL).   All of these switches work with the Windows AFS client.

My final comment for this post is that evaluating AFS directories which contain symlinks is an extremely expensive operation.  Unlike the POSIX equivalents, a Windows directory enumeration always returns the WIN32_FIND_DATA structure for each directory entry which contains the file attributes.  A reparse point to a directory must have the FILE_ATTRIBUTES_DIRECTORY bit set and a reparse point to a file must not.  All of the other fields of the WIN32_FIND_DATA structure can be determined from the reparse point itself but AFS does not have a method of hinting the client what the type of the target object is.  As a result, the target path must be evaluated for each and every directory listing.  A directory such as /afs/andrew.cmu.edu/ which contains more than 30,000 relative symlinks to directories will require nearly twice that number of RPCs to the file server to complete the directory enumeration.  Something to think about when planning your AFS name space.

Thursday, March 14, 2013

JPSoftware's Take Command and OpenAFS

I have been a user of Rex Conn's replacement command processors since the early days of 4DOS.  When I switched to OS/2 and began work on OS/2 C-Kermit, 4OS2 was there for me.  When I added REXX language support to OS/2 C-Kermit, 4OS2 added it as well.  When I moved to Windows NT, there was 4NT waiting for me.  In 2003 I began my work on OpenAFS for Windows (WinAFS) which at the time was implemented as a locally SMB server proxy to the AFS name space.  Before I started work on the WinAFS client, the only method of accessing the AFS name space was by use of Windows drive letter mappings.  It wasn't possible to consistently access the AFS name space via a UNC path.  It wasn't until the OpenAFS 1.3.66 release in July 2004 that it became possible to live entirely in a UNC \\AFS\cellname\path\ world except that the Microsoft command processor (cmd.exe) does not permit UNC paths to be the current directory.  4NT on the other hand supported UNC paths as the current directory for years and it was a natural fit.  Drive letter mappings suddenly became no longer necessary for my day to day activities.

For those readers that are not long time AFS users there are some important things to understand about the AFS name space.  Unlike a Windows file share, the UNC path \\server\share\ does not refer to a single on-disk volume on the specified machine.  Instead with AFS UNC paths \\afs\cell\ refers to the root directory of a volume named root.cell in the specified AFS cell.  AFS UNC paths are location independent and do not signify on which physical machines the data is stored.  In fact, root.cell is in most cases a geographically replicated volume. In addition to directories and files, AFS supports mount points and symlinks as first class file system types.  An AFS mount point is an object that refers to the root directory of another AFS volume and symlinks can refer to any absolute or relative file path.

The AFS name space can therefore be viewed as a directed graph of volumes joined to other volumes where each volume contains a directory tree.  Volumes can be either read/write or read-only snapshots of a read/write volume.  Volumes can be assigned quotas or can be permitted to grow to fill the entire partition on which they are stored.  AFS volumes can be migrated from server to server while in use and the amount of free space can change as a result of the volume being moved.  The AFS name space is therefore a challenge to use when it is accessed via the SMB protocol.

SMB file shares were designed prior to the existence of NTFS Junctions and NTFS Symlinks (added in Vista and Server 2008).  The assumption is that there is only one volume on one partition located at the other end of a UNC path.  Obtaining the free space is most often performed using GetDiskFreeSpace which can only refer to root directories and not GetDiskFreeSpaceEx which can refer to arbitrary paths.  Even the MSDN documentation for these APIs states that the reason to use the Ex version is to avoid unnecessary arithmetic whereas the most important reason for using the Ex version in my opinion is that it works with complex name spaces constructed by NTFS junctions and AFS mount points.

Since the AFS name space is made up of a potentially infinite number of volumes joined together via mount points and volumes can sometimes be read/write and other times be read-only, how should the WinAFS SMB server respond when it is asked to report the total disk space and total free disk space?  Its impossible to provide an accurate value for either of these.  As a result the AFS SMB server would simply lie.  It would report an arbitrarily large number for the partition size and the free space.  Free space was always reported even when there was absolutely none.

Which brings us back to JPSoftware and 4NT.  While it wasn't possible for arbitrary volume information to be obtained via the Win32 API, the AFS fs command obtains this information via the afs path ioctl interface.  In September 2005 Rex Conn added OpenAFS specific knowledge and functionality to 4NT 7.0:
  1. The command parser understands UNIX style inputs /afs/your-file-system.com/user/jaltman and automatically converts them to UNC notation \\afs\your-file-system.com\user\jaltman when the first component matches the AFS "NetbiosName".
  2. The command language contains @AFSCELL, @AFSMOUNT, @AFSPATH, @AFSSYMLINK, @AFSVOLID, @AFSVOLNAME functions which operate on paths and return AFS specific data.
  3. Free space computations use AFS volume information so it is accurate even when the Win32 GetVolumeInformation() call executed over SMB would not be.
Over the last five years as the AFS Redirector has been developed 4NT (now called Take Command) has been a constant companion.  One of my favorite features of Take Command directory listings is its awareness of Reparse Points.  For example:
As you can see, directory listing expand the target of NTFS Junctions and Symlinks providing the target information.  I have for the longest time wanted this behavior for AFS.   Unfortunately, up until a late TC 14.03 build, Take Command did not understand how to parse the AFS Reparse Point data.  Now that it does we get the same useful output:

Although not shown, symlink to file targets are displayed as well.
 
With the release of Take Command 15.0 and OpenAFS 1.7.22 the circle has now been completed.  Not only can Take Command display AFS mount point and symlink targets, but Take Command's MKLINK command can be used to create symlinks to both files and directories, and the DEL and RMDIR commands can be used to remove them.

Take Command's GLOBAL command can either cross [/J] or not cross [/N] junctions as specified.

Finally, Take Command properly uses GetVolumeInformationByHandle() to obtain volume information.  As a result the built-in AFS functions operate even when AFS is accessed via an NTFS directory symlink.

I recommend Take Command for any user of OpenAFS that relies upon the command shell.

For further information on Take Command visit the JP Software web site at http://jpsoft.com/.

Monday, November 5, 2012

OpenAFS Windows IFS Thirteen Months Later

On 18  September 2011, I discussed the release of the first OpenAFS release that included a native installable file system redirector.  It is often said that it takes ten developer years to shake out all of the bugs and performance glitches in a new file system.  The last year has certainly seen its fill of BSODs, deadlocks, hiccups, and application interoperability issues.   Today, I am releasing version 1.7.18.  Over the last thirteen months more than 750 changes have been implemented improving performance, stability, and application compatibility.   This post will highlight some of the challenges and lessons learned in the process.

Antimalware Filter Driver compatibility
The vast majority of problems that end users have experienced with the AFS redirector have been related to interactions with Anti-Virus and other forms of content scanners which install filter drivers on the system.  Life would be much easier if there was a standard set of hooks that these products could use to scan files and deny access, quarantine, or otherwise alter the normal application data access patterns.  Unfortunately that is not the case and learning what works and what doesn't has often been left to trial and error.

Since AFS is a network file system that relies upon credentials that are independent of the local operating system there are added complexities.  For example, when Excel opens a spreadsheet file it uses the AFS tokens which are available to the active logon session.  The anti-virus service on the other hand is running as an NT service as the SYSTEM or other account in a different logon session.  As such, it does not have access to the user's AFS tokens unless the requests to scan the file content is performed by borrowing the File Object from Excel or impersonating the Excel process' security context.    Most anti-virus products do impersonate the calling thread or borrow the File Object but not all do.   Versions of Microsoft Security Essentials prior to 2.0 did not and it was a significant problem for OpenAFS.

Anti-virus scanners can choose to scan during the CreateFile operation and during the CloseHandle operation (aka File Cleanup.)  The challenge here for the AFS redirector is that it must hold various locks in order to protect the integrity of the data and provide cache coherency with the file server managed data versions.  Anti-virus scanners can hijak the thread performing the CreateFile or Cleanup and inherit the locks that are already held or they can spawn a worker thread to re-open the file perform a scan and close it again while the application initiated CreateFile or Cleanup is blocked.   Any locks that are held across CreateFile or Cleanup which are required by the anti-virus worker thread will result in a deadlock.   Failure to hold the locks can result in data corruption.   Sophos and Kaspersky were two of the most challenging products to learn to interact with safely.

Microsoft periodically organizes File System Filter Driver PlugFests which provide file system developers, anti-virus vendors, encryption products, content scanners, and others to test their forthcoming products against Microsoft's upcoming operating system releases.  The PlugFest is also an opportunity for third-party vendors to perform interoperability testing with each other.   It was unfortunate that due to increased secrecy regarding the development of Windows 8 and Server 2012 that Microsoft was unable to hold a PlugFest for more than a year.  But in 2012 there were two events in February and August.

The February PlugFest was the first opportunity to interop with a broad range of vendors since the release of 1.7.1.  At that event every Interop session was a painful experience.  During that week 1.7.7 was scheduled to be released but it had to be pulled because of the many problems (deadlocks, BSODs, and data corruption) that were identified during the interop testing sessions.

This past August's experience was the complete opposite.  The code that would become the 1.7.17 release including Windows 8 and Server 2012 specific functionality was tested.  Other than a minor error that was uncovered during the first interop session with Microsoft's own anti-virus engine used in Security Essentials and Windows Defender there was not a single hiccup the rest of the week.  As it turns out, the AFS redirector was the only non-Microsoft file system to implement all of the required new interfaces for Windows 8.

Application Compatibility
Of course, compatibility with deployed applications is the goal.   Whenever possible applications should be unaware that its data is being stored in AFS as opposed to Windows built-in file systems such as NTFS and CIFS.  This challenge is made more complicated by the fact that most applications do not implement feature tests for optional file system APIs.  Instead they just assume that every feature implemented by NTFS or CIFS will be available everywhere.  The deciding factor between whether the file system is local or remote is often decided by whether or not UNC path notation is used.   Things should become easier for non-Microsoft file systems now that Microsoft has introduced ReFS, a new file system that does not implement many features of NTFS including transactions, short names, extended attributes or alternate data streams; none of which are implemented by the AFS redirector.

Still, it is worth noting that the AFS redirector is a very complete implementation of the NTFS and CIFS feature set including support for CIFS Pipe Services such as WKSSVC and SRVSVC and a full implementation of the Network Provider API.  Both the Pipe Services and the Network Provider API are used by applications to browse the capabilities of the network file system and the available resources such as server and share names.   The Network Provider API is also responsible for managing drive letter to UNC path mappings and a path name normalization.   One example of a Network Provider incompatibility was the failure to implement network performance statistics which resulted in periodic 20 second delays from within the Explorer Shell.

Reparse Points
One of the most significant visible changes between the SMB gateway interface and the native AFS redirector is the use of file system Reparse Points to represent AFS Mount Points and Symlinks.  Unlike POSIX symlink which are unstructured data, a Windows File System Reparse Point is a tagged structured data type.  Microsoft maintains a registry of all of the tag values and which organization they are assigned to.  More than 50 reparse point tags have been registered and OpenAFS is the proud assignee of IO_REPARSE_TAG_OPENAFS_DFS (0x00000037L).  The OpenAFS Reparse Tag Data has three sub-types (Mount Point, Symlink, UNC Referral) which are used to export the target information for each.

When the SMB gateway was used, the entire AFS name space appeared to applications as a single volume exported as as single Windows File Share.  It was not possible for Windows to report volume information (quota, readonly status, etc) or detect out of space conditions prior to the application filling the Windows page cache.  Now that reparse points are in use, Windows applications can recognize that a path might have crossed from one volume to another.  Tools such as robocopy that are Junction (aka Reparse Point) aware can perform operations without crossing volume boundaries.

While this is a major improvement in capability, it is also a dramatic change in behavior for applications.  Some applications rely upon the assumption that a Windows File Share can only refer to a single volume and further assume that any file path using UNC notation is a path to a Windows File Share.  Such applications can become confused when they query the volume information of \\afs\example.org\ and told that the volume is READ_ONLY when the full target path \\afs\example.org\user\j\johndoe\ is not.  This is a deficiency in the application and not a fault of the file system.

One downside of the reparse point model is that applications need to understand the format of the structured data to make use of it.  Tools such as JPSoftware's Take Command are reparse parse point aware but can not at present properly display the target information.  The same is true for Cygwin and related tools.

Authentication Groups
The SMB gateway client associated credentials with Windows account usernames (or SIDs).  The AFS redirector tracks process creation and associates credentials with Authentication Groups (AG).   Each process inherits an AG from the creating thread and can create additional AGs to store alternate sets of credentials.  When background services such as csrss.exe and svchost.exe execute tasks on behalf of foreground processes they impersonate the credentials of the requesting thread.  By impersonating the caller, the background thread informs the AFS redirector which credentials should be used.

Sometimes a mistake is made and the background service fails to impersonate the caller and instead attempts to rely upon the service's own credentials to perform its job.  This is the case with conhost.exe when it attempts to access or manipulate the contents of the "Command Prompt.lnk" shortcut.  As a result the contents of cmd.exe shortcuts are ignored when initiating command prompt console sessions.

When Will 1.8 Ship?
Users frequently ask "when will 1.8 ship?  I don't want to deploy the new OpenAFS client until it is production quality."  The reason that the OpenAFS client is 1.7.x and not 1.8.x has less to do with stability than it has to do with the rate of change and unfinished work. The Windows platform has new releases issued every one to two months whereas the rate of issue for the servers and UNIX clients is one every six to twelve months.  The rate of change to support new features or improve compatibility and performance on Windows is significantly higher.  Nearly 1/3 of all patches contributed to OpenAFS.org are new functionality for Windows.  Please do not focus so much in the version label.

1.8 will be issued when the rate of change in the Windows client drops to the point where a new release each month is no longer desirable.  The two most significant areas of work that need to be addressed before a 1.8 release are in the Kerberos bindings and the Installer.  At present, the 1.7.x binaries are built directly against the MIT KFW 3.2 libraries. This permits OpenAFS to work with KFW 3.2 and the KFW translation layer provided by Heimdal 1.5.  However, the KFW 3.2 API does not permit fined grained control over the use of DES encryption types nor is it guaranteed to work with future KFW releases from MIT.  The installer requires ease of use improvements.  The user should not be prompted when files are in-use but should always be prompted to provide a cell name unless the installation is an upgrade.

What Comes After 1.8?
With large scale deployment comes operational experience.  The AFS Redirector design has been shown to have weaknesses that result in a larger than desired in-kernel memory footprint.  There are three areas in which a redesign would be desirable:

1. The File Control Blocks (FCB) and the Object Information Control Blocks (OICB) are bound to one another even though they could very well have different life spans.  An FCB must exist as long as there is an open HANDLE.  Multiple open handles for the same file system object refer to the same FCB.  The FCB contains metadata about the file object that is specific to the file system in-kernel.  It tracks the allocated file size, the list of data extents that are present in-kernel, etc.  For each FCB there must exist an OICB which contains the AFS specific meta data associated with the file object including AFS data version, AFS FileID, etc.   While an OICB must exist for an FCB, it does not have to be the other way around.

The mutual binding of the OICB and the FCB makes garbage collection more difficult than it needs to be.  Some of the race conditions that were fixed in the 1.7.18 release were the result of this complexity.  One of the important goals of a redesign is to break this mutual dependency and instead only maintain a reference from the FCB to the OICB and not the other way around.   Doing so will permit FCBs to be garbage collected when the last handle is closed and OICB objects to be garbage collected with their active reference counts reach zero.  The garbage collection worker thread will hold fewer locks and have a smaller impact on file system performance.

2. The Directory Entry Control Blocks (DECB) also maintain a reference to the OICB.  In fact, each time a directory is enumerated to satisfy FindFirst/FindNext API requests, not only is a DECB allocated but an OICB is as well.  Permitting the OICB to be allocated only when a FCB is allocated instead of as part of directory enumeration will reduce the in-kernel memory footprint.

3. Directory enumeration is currently performed for the entire directory not only when the directory object is opened by an application but also when a FindFirst API is issued for a non-wildcard search.   The vast majority of FindFirst searches are non-wildcard searches for explicit names.  Instead of populating the full contents of the directory in-kernel, the memory footprint can be further reduced by pushing those queries to the afsd_service process.

4. File data is exchanged between the afsd_service and the Windows page cache by sharing a memory-mapped backing store between the AFS Redirector and the afsd_service.   The control over specific file extents is managed by a reverse ioctl interface between the redirector and the user-land service.  This protocol is racy and can result inefficient exchanges of control.  Replacing the existing protocol with one that tracks extent request counts and active reference counts will reduce wasteful exchanges and improve data throughput.

These proposed changes are a significant undertaking and they will not appear in the 1.7.x/1.8.x release series. 

Credits
The OpenAFS for Windows client is the product of Your File System, Inc., Kernel Drivers, LLC, and Secure Endpoints, Inc.  To support the development of the OpenAFS for Windows client, please purchase support contracts or make donations.  The recommended donation is $20 per client installation per year.

Saturday, November 3, 2012

I want my Windows IFS OpenAFS Client to be fast

In 2008 I wrote I want my OpenAFS Windows client to be fast which described the options I used to tune the Windows OpenAFS client that used the SMB server gateway.   As of this writing the current release of OpenAFS for Windows is 1.7.18 which is based upon a native Windows Installable File System, AFSRedir.sys.  This post is an update describing the configuration values I use with the native redirector interface.

The most important related to throughput fall into two categories:

How much data can I cache?
CacheSize
Stats

How Fast Can I Read and Write?
BlockSize
ChunkSize
Daemons
RxUdpBufSize
SecurityLevel
ServerThreads
TraceOption












All of these options are described in Appendix A of the Release Notes.  Here are the values I use:

CacheSize = 4GB (64-bit)  1GB (32-bit)
Stats = 60,000 (64-bit)  30,000 (32-bit)

BlockSize = 4
ChunkSize = 21 (2MB)
RxUdpBufSize = 12582912
SecurityLevel = 1 (when I need speed I use "fs setcrypt" to adjust on the fly)
ServerThreads = 32
TraceOption = 0 (no logging)

None performance related options that I use:

DeleteReadOnly = 0 (do not permit deletion of files with the ReadOnly attribute set)
FollowBackupPath = 1 (mount points from .backup volumes search for .backup volumes)
FreelanceImportCellServDB = 1 (add share names for each cell in CellServDB file)
GiveUpAllCallbacks = 1 (be nice to file servers)
HideDotFiles = 1 (add the Hidden attribute to files beginning with a dot)
UseDNS = 1 (query DNS

Sunday, October 2, 2011

Heimdal: Now Playing on Windows Near You

Today, Heimdal 1.5.1 was announced including support for Microsoft Windows.  Asanka Herath gave an excellent presentation on the design plans at the 2010 AFS and Kerberos Best Practices Workshop.  The Heimdal port began in December 2008 in response to several motivations:
  1. Several large Secure Endpoints clients were experiencing significant upgrade problems with MIT Kerberos for Windows due to backward compatibility problems between versions 2.6.x and 3.x.  The problems were due to what is affectionately known as DLL Hell.  Applications built against old versions of KFW do not work with newer versions and vice versa because the list of function exports and the ordinal bindings changed.  To make matters worse, it isn't possible to have more than one version of KFW installed on a system at any given time.  This is because KFW libraries must be installed in a directory listed in the system PATH environment variable.  To address this problem Secure Endpoints issued a proposal to MIT in July 2008 that KFW be converted to use Windows Side-by-side Assemblies.  This proposal along with others to improve Network Identity Manager went over like a lead balloon at the Kerberos Consortium.
  2. Secure Endpoints began work on incorporating Hardware Secure Modules such as Thales' nShield into a Kerberized Certificate Authority that could be approved of by The Americas Grid Policy Management Authority.  TAGPMA requires that all certificate authorities store their keys in hardware.  This naturally led us to wonder if we could do the same for a Kerberos Key Distribution Center (KDC).  Heimdal already supported the OpenSSL crypto library which could be used with the nShield HSM.  Asanka presented our ideas at the 2009 AFS and Kerberos BPW.
  3. Finally, OpenAFS needed a number of changes to Kerberos and GSS-API in order to be able to implement the rxgk security class.  There have been numerous presentations on the need for rxgk over the years. Love gave a talk in 2007, Simon gave one in 2010, and another in 2011.  In fact, the rxgk work began back in 2004 at an AFS hackathon in Sweden.  Implementing rxgk requires that all supported platforms provide a Kerberos Crypto Framework (RFC 3961) and the GSS Pseudo-Random Function (RFC 4401).  MIT Kerberos doesn't export a 3961 compatible crypto framework in any version and with the failure to put any resources behind the Windows product there was no GSS PRF support.  The OpenAFS development community has found the Kerberos Consortium quite difficult to work with whereas Heimdal welcomed the proposed changes with open arms.  Heimdal redesigned their repository layout to make it possible for OpenAFS to import core functionality such as the cross-platform compatibility library libroken, the hcrypto library, and the rfc3961 framework.  This in turn permits OpenAFS developers to focus on building a best of breed distributed file system and avoid the need to build and support a Kerberos v5 and GSS-API implementation.  Heimdal is more than just a Kerberos implementation which will permit OpenAFS to more easily support non-Kerberos authentication mechanisms once rxgk is deployed.
The Secure Endpoints distribution of Heimdal is more than just a port to Microsoft Windows.  In order to properly address the needs of existing KFW users and developers, the Heimdal distribution includes a set of KFW 3.x compatible DLLs that act as a shim layer that converts requests issued using the MIT API and forwards them to the Heimdal assembly for processing.

For developers, Secure Endpoints is now distributing a Kerberos Compatibility SDK that will permit applications to be developed which can work seamlessly regardless of whether Heimdal or MIT Kerberos in installed on the system.  OpenAFS and all future Secure Endpoints applications such as Network Identity Manager and the Kerberized Certificate Authority will be built against this SDK.  Applications built against the SDK first search for a compatible Heimdal assembly.  If an assembly is not installed on the system, KFW DLLs are searched for in the PATH and manually loaded.

One important difference between Heimdal and KFW related to how credential caches and keytabs are implemented.  Instead of compiling all supported cache and keytab types into the Heimdal libraries, Heimdal loads credential caches and keytabs as registered plug-ins.  This permits weak cache and keytab implementations to be removed on systems where they shouldn't be supported and permits new implementations to be developed independently of the Heimdal distributions.  This functionality is going to become very useful for OpenAFS users on Microsoft Windows now that OpenAFS 1.7.x includes native authentication groups.  For the first time it will be possible to develop secure Kerberos credentials cache and keytab implementations whose contents become accessible to processes that are impersonating other processes something that has only been possible with the Microsoft Kerberos SSP up to this point.

All in all, the release of Heimdal for Microsoft Windows is an important step forward.