|
--- |
Using PROC SQLPROC SQL is provided as part of base SAS and is a very comprehensive implementation of this standard data query and data manipulation language. You can find more information about SQL here and SAS's own documentation for PROC SQL is here (version 8) and here (version 9). Having used PROC SQL more and more over the years, I'm now of the opinion that it should be your first choice when writing the bulk of your data manipulation and reporting code. Data step code still has its place, of course, but I would recommend using PROC SQL whenever possible, and data step only when necessary. So, why is SQL so useful? The main advantage is that it is so compact. Even allowing for the verbose nature of most SQL statements, you can still achieve more in a few lines of SQL than you can in twice as many lines of plain data steps and PROC SORTs. Data steps are quite fragile, in that they often require that their input dataset is sorted in a specific order; this means that, whenever you write a BY statement in a data step, you have to either (a) add a preceding PROC SORT to get the data into the right order, or (b) analyse the preceding steps to ensure that your input data is guaranteed to be in the correct order. PROC SQL doesn't have this requirement - you never need to worry about the ordering of your input data. Example 1 proc sort data=indata.demog out=demog; by subjid; run; proc sort data=indata.trt out=trt; by subjid; run; data merged; merge demog (in=a) trt; by subjid; if a; run; Using SQL, this reduces to a single statement: proc sql; create table merged as select a.*, b.trtcode from demog.a left join trt.b on a.subjid = b.subjid ; quit; Even using my indenting style, PROC SQL gets the job done in six lines of code instead of 11, reducing the chance of errors and making the code easier to read and easier to maintain. More significant than that, though, is the difference in feel between the two examples; the SQL is cleaner and more specific about its intentions - merge ('join' in SQL terminology) the variable TRTCODE from TRT onto DEMOG. The data step code, although it works for now, introduces an insidious bug - being inherently lazy, I didn't bother to restrict the merge to only take variable TRTCODE from TRT. Let's say someone later adds a variable called VISIT_DATE to TRT - the datastep code will take that variable from TRT and overwrite the existing VISIT_DATE in DEMOG - disaster! The SQL version, being more explicit about its intentions (despite also being much shorter) is safe from this bug. Example 2 proc sort data=indata.demog out=demog; by subjid; run; proc sort data=indata.trt out=trt; by subjid; run; data no_dem; merge demog (in=a) trt (in=b); by subjid; if b and not a; run; proc means data=no_dem noprint; class trtcode; output out=result (where=(trtcode ne '') keep=trtcode n) n=n; run; There are several potential problems here: there are two unnecessary PROC SORTs, and the merge relies on the statement if b and not a being coded correctly. This particular solution to the problem required 15 lines of code. Here's the SQL version: proc sql; create table result as select distinct trtcode, count(distinct subjid) as n from trt where subjid not in (select subjid from demog) group by trtcode ; quit; Six lines of code, just one-third of the non-SQL version. And again, the SQL solution is not only much more concise, it's more concise precisely because it is much more explicit and straight-to-the-point about what it's setting out to achieve. All-in-all, a vast improvement in readability, robustness and ease of maintenance. Learning PROC SQL As with most things, the best way to learn SQL is to simply start playing with it and trying a few things out. You'll soon find that you can write very powerful searching, merging and summarising queries with just a few lines of code. Give it a try! PROC SQL tip * Count number of visits per subject; proc sql; create table out1 as select subjid, count(visit_date) as visits from in1 group by subjid ; quit; * Remove duplicates; proc sort data=in1 nodupkey; by subjid; run; The novice programmer has added the PROC SORT to "remove duplicates" because, by default, SQL outputs one row for every row in the input, even when you apply a summarisation function like count(). If you don't want this, just add the distinct keyword to the select statement; you can then get rid of the extra PROC SORT: * Count number of visits per subject; proc sql; create table out1 as select distinct subjid, count(visit_date) as visits from in1 group by subjid ; quit; |
Comments
View comments