|
Try the Elvis SAS Log Analyser --- Zipped datasets sas7bdat.zip (30k) --- Hex dumps: Unix: Windows: what is a hex dump? --- SAS7BDAT readers: SAS |
The SAS7BDAT file formatThis page gives some notes on the SAS7BDAT file format, which is currently the main format used for storing SAS datasets across all platforms. The format is proprietary to the SAS Institute but having successfully reverse-engineered most of the old PC-SAS v6 file format back in the 1990s, I've always wanted to spend a bit of time figuring out the internals of the more complicated SAS7BDAT format. Why do this? Well, partly out of simple intellectual curiosity, but I think this sort of information will also become increasingly useful as initiatives like CDISC open up the pharmaceutical clinical trials programming industry to third-party tools and systems. Being able to at least read some of the structure from native SAS7BDAT files will allow tighter integration of modern systems with legacy SAS systems. The eventual plan is to publish a C++ class here for opening, reading and possibly modifying SAS7BDAT files. If you're interested in this information, please leave a comment - any further insights and contributions are also welcome. As you'll see if you read on, the information here is quite sketchy for now. I'll be adding to it on a regular basis as time allows me to narrow down more of the details. Data Files Technique data vars0obs0lab0; stop; run; Next, one creates a dataset with one variable, but still with no observations and no label: data vars1obs0lab0; x = .; delete; run;
The differences between these two files will reveal where and how the list of variables is stored within the file. Eventually one can build up a picture of where all the important bits of information are stored, so that they can be read and decoded without needing a SAS installation to read them. Limitations Initial thoughts The CONTENTS Procedure Data Set Name: D.VARS0OBS0LAB0 Observations: 0 Member Type: DATA Variables: 0 Engine: V8 Indexes: 0 Created: 16:31 Wednesday, April 2, 2008 Observation Length: 0 Last Modified: 16:31 Wednesday, April 2, 2008 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 4096 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 3616 Obs in First Data Page: 0 The CONTENTS Procedure Data Set Name: D.VARS1OBS0LAB0 Observations: 0 Member Type: DATA Variables: 1 Engine: V8 Indexes: 0 Created: 16:31 Wednesday, April 2, 2008 Observation Length: 8 Last Modified: 16:31 Wednesday, April 2, 2008 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 4096 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 501 Obs in First Data Page: 0 Both these files, which were created on a Windows machine, are 5,120 bytes long. As PROC CONTENTS tells us that the file structure includes something called a 'data set page', and that each of these files only has one of them, and each is 4,096 bytes long, we can infer that the file also contains an extra 1,024 bytes. Therefore, it seems very likely that the first 1,024 bytes of each file is a generic header, which we would expect to contain things that only occur once per file, like the dataset name, creation date, modification date, etc. The fact that PROC CONTENTS reports 'First data page' and 'Obs in First Data Page' strongly suggests that some repeating metadata is stored, not in the general header, but in the first data pages. This makes sense, because we know that a dataset can have thousands of variables, each with a long name, a label, and so on, and all that metadata couldn't possibly fit into a fixed-size general header. A working assumption for now, then, is that the list of variable metadata is stored in the first data page, rather than in the general header. The actual data then starts part-way through the first data page, which is why it holds fewer observations than the other data pages. It's interesting to note that a 4,096 byte data page can only hold 3,616 observations, even when the observation length is 0 (this is probably spurious - PROC CONTENTS has probably narrowly avoided a divide-by-zero and is reporting some sort of best guess). Is there a per-observation overhead, or is it per-page? The second dataset, with an observation length of 8, can store 501 observations per page, or 501 * 8 = 4,008 bytes. With even one byte per observation of overhead, 501 observations wouldn't fit (501 * 9 = 4,509). With no overhead at all, though, 512 observations would fit. It seems likely, then, that there is some sort of per-page header block of around 88 bytes - possibly less, but no less than 80 bytes, otherwise an extra observation could have been fitted in. Platform independence The CONTENTS Procedure Data Set Name: D.VARS0OBS0LAB0 Observations: 0 Member Type: DATA Variables: 0 Engine: V8 Indexes: 0 Created: 11:26 Wednesday, April 2, 2008 Observation Length: 0 Last Modified: 11:26 Wednesday, April 2, 2008 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 7256 Obs in First Data Page: 0 The CONTENTS Procedure Data Set Name: D.VARS1OBS0LAB0 Observations: 0 Member Type: DATA Variables: 1 Engine: V8 Indexes: 0 Created: 11:26 Wednesday, April 2, 2008 Observation Length: 8 Last Modified: 11:26 Wednesday, April 2, 2008 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 1005 Obs in First Data Page: 0 Clearly, each SAS platform has its own preferred values for the size of the general header and data pages. However, each platform is also flexible enough to read dataset files with non-native values. The size of the header is therefore very likely to be stored somewhere in the header - that will be a key value to find. The size of the data pages is probably also stored in the header. The size of the general header on Unix seems to be 8,192 bytes, instead of 1,024 on Windows. That's something of a puzzle - clearly, if my reasoning is right so far, there can be header information above 1,024 bytes in a Unix file that isn't necessary in a Windows file. My reasoning may well be wrong, or it may just be that the part of the header above 1,024 bytes on Unix isn't used - something else to investigate. The per-page overhead seems to be larger in these datasets - 1,005 8-byte observations leave 152 bytes spare in each page. The size of the per-page header, if there is one, is therefore variable and must also be stored somewhere in the dataset, probably in the general header. Dataset Header The string 'SAS FILE' is fairly uninformative - it's likely that this is present as a simple check that the file being opened really is a SAS file, and indeed, changing the string to 'DDD FILE' in a hex editor causes SAS to reject the file with 'ERROR: File T.VARS0OBS0LAB0.DATA is not a SAS data set'. The string 'DATA ' identifies this particular SAS file as a dataset - the same location reads 'CATALOG ' in a formats catalog file. Next, things get more interesting. In the Windows dataset, the next sixteen bytes starting at offset 0xa4 give the dataset's creation and modification timestamps, stored as little-endian eight-byte floating-point values. This can be confirmed in two ways - first, modify one or two bytes in a hex editor and re-run PROC CONTENTS to see if the reported creation / modification timestamps have changed; and second, convert the floating-point values back to numerics in SAS: 100 data _null_; 101 dt = input('41d6b0eb1e6851ec', hex16.); 102 put dt= datetime20.; 103 run; dt=02APR2008:16:31:54 The timestamp of 02APR2008:16:31:54 matches that given by PROC CONTENTS earlier. Note that the bytes from the Windows dataset have to be reversed into big-endian order in the input() function for this to work - the SAS hex. format expects big-endian byte order on all platforms, including those which are natively little-endian. Looking at the hex dump for the Unix version of this file, however, things are different. There are four zero bytes at offset 0xa4, and the creation and modification dates start at offset 0xa8. Furthermore, the floating-point values are big-endian. The zero bytes are probably there for data-alignment reasons, and the variable endianness is probably intended to maximise performance when a dataset is read on its native platform. This suggests that large SAS datasets moved from a Unix box to Windows, for example, should be converted to native Windows format to improve performance. The data-alignment and endianness must be specified somewhere early on in the header, so that SAS can determine how to read the rest of it. Comparing the Unix and Windows files, the first 32 bytes are identical in each, but a few bytes starting from offset 0x23 are different - specifically, bytes 0x23, 0x25 and 0x26 all differ just in bit 0, and byte 0x28 is 0x01 in the Unix file and 0x04 in the Windows file. Further experimentation with larger files suggests that the two bytes starting at offset 0xc9 (in the Windows files) hold the dataset page size, and the next four bytes starting at offset 0xcb hold the number of dataset pages. In the Unix files, the same fields occur four bytes further on, and are big-endian instead of little-endian. |
Comments
View comments