1         Data modification

This part includes: renaming variable, different types of merges between files, reshaping data (wide to long, long to wide in simple macro)

 

1.1       Reorder variables

 

PROC SQL;

CREATE TABLE      reorderedfile AS

SELECT        var1,   var2,   var3

FROM                        filename;

QUIT;

 

 

1.2       Renaming variable

There is a trick to rename variables when you use the function rename depending on the position of rename in codes.

 

·         PUT RENAME IN THE POSITION OF SOURCE FILE:

DATA target; set source (rename= (var1=newvarname));

RUN;

 

In the case above, the var1 will be renamed before assign to the new dataset. In this case, you can directly use the newvarname to do further work inside of the code above.

 

·         PUT RENAME IN THE POSITION OF TARGET FILE:

DATA target rename= (var1=newvarname)); set source;

RUN;

 

In the case above, the var1 will be renamed after assign to the new dataset. In this case, you CAN NOT use the newvarname to do further work just inside of the code above.

 

1.3       Inserting rows into an existing table

Now suppose w need to add extra record into a existing data (table). Here is the SAS SQL code (if it is characteristic variable, you need to put   . If it is numeric variable, you don’t need   ):

 

·         PROC SQL PROCESS:

PROC SQL;

INSERT INTO           libname.SASfilename

SET                            varname1= “name”,

                                    varname2=value,

                                    varname3=value,

                                    varname4= “name”              /*notice no semi colon here now if there is a second record need to be inserted*/

SET                            varname1= “name”,

                                    varname2=value,

                                    varname3=value,

                                    varname4= “name”              /*notice no semi colon here now at the end of the record*/

;

QUIT;

 

A concise version for the SQL code above is:

 

INSERT INTO           libname.SASfilename

                                    (varname1, varname2, varname3, varname4)

VALUES                    (“name”, value, value, “name”)

VALUES                    (“name”, value, value, “name”)

;

 

 

·         DATA STEP PROCESS:

 

DATA libname.SASfilename

                                    varname1= “name”;

                                    varname2=value;

                                    varname3=value;

                                    varname4= “name”;

OUTPUT;                  

                                    varname1= “name”;

                                    varname2=value;

                                    varname3=value;

                                    varname4= “name”  ;

            OUTPUT;

            STOP;

MODIFY libname.SASfilename;

RUN; 

 

 

After you run the code in your SAS program, you will get log confirms the operation:

 

2 rows were inserted into libname.SASfilename.

 

1.4        Deleting rows

 

·         PROC SQL PROCESS

 

PROC SQL;

DELETE FROM                    libname.SASfilename

WHERE                                 varname IN (“name”)

;

QUIT;

 

·         DATA STEP PROCESS

 

DATA                         libname.SASfilename;

MODIFY                     libname.SASfilename;

WHERE                     varname IN (“name”);

REMOVE;

RUN;

 

1.5       Making value changes

 

·         PROC SQL PROCESS

 

PROC SQL;

UPDATE                    libname.SASfilename

SET                            varname1=new value

WHERE                     varname2= “name”;                         /*this is the condition to change value in varname1*/

QUIT;

 

·         DATA STEP PROCESS

 

DATA                         libname.SASfilename;

MODIFY                     libname.SASfilename;

IF                                             varname1= “name” THEN DO;

                                                Varname2=new value;

                                    REPLACE;

                                    END;

RUN;

 

 

1.6        Merging dataset

I found merging in PROC SQL has better functionality and understandable steps to do certain merging process. Here are some functions:

 

·         CROSS JOIN

 

PROC SQL;

CREATE TABLE tablename AS

SELECT                    *

FROM                        filename1 CROSS JOIN filename2

;

QUIT;

 

The code above will make each row in the first table join with each row in the second named as Cartesian product.

 

·         FULL JOIN

Sort is not required when you use PROC SQL for the join means convenience!

 

PROC SQL;

CREATE TABLE tablename AS

SELECT                    COALESCE (file1.key, file2,key) AS KEY,

                                    var1,

var2

FROM                        file1 FULL JOIN file2

ON                              file1.key = file2.key

;

QUIT;

 

·         LEFT JOIN

Will keep the match rows plus the unmatched rows from the first/left file and only join the match rows from the second/right file

 

PROC SQL;

SELECT                    *

FROM                        file1 LEFT JOIN file2

ON                              file1.key = file2.key

;

QUIT;

 

·         RIGHT JOIN

Will keep the match rows plus the unmatched rows from the second/right file and only join the match rows from the first/left file

 

PROC SQL;

SELECT                    *

FROM                        file1 RIGHT JOIN file2

ON                              file1.key = file2.key

;

QUIT;

 

·         INNER JOIN

Will only join matched rows from both sources.

 

PROC SQL;

SELECT                    *

FROM                        file1 INNER JOIN file2

ON                              file1.key = file2.key

;

QUIT;

 

·         SET OPERATOR—UNION

Difference between join and set operator: join is to align rows and accrete columns; set operators align columns and accrete rows. UNION set operator will include all observations from both files into column format (like append) except duplicate observations will be eliminated.

 

PROC SQL;

SELECT                    *

FROM                        file1

UNION

SELECT                    *

FROM                                    file2

;

QUIT;

 

·         INTERSECT

Will only select observations appear in both file

 

PROC SQL;

SELECT                    *

FROM                        file1

INTERSECT

SELECT                    *

FROM                                    file2

;

QUIT;

 

·         EXCEPT

Will only select observations appear in first file but not in the second file.

 

PROC SQL;

SELECT                    *

FROM                        file1

EXCEPT

SELECT                    *

FROM                                    file2

;

QUIT;

 

 

1.7       Check two consecutive values

 

·         EXAMPLE1

 

DATA newdataset;  

SET sourcedataset;

ARRAY a[*] a1-a35;

ARRAY b[*] b1-b35;

ARRAY c[*] c1-c35;

DO LOOP=1 TO dim(PSA);

*check two consecutive value, if satisfy criteria, select relevant value. Here if two consecutive values are all greater than 0.2, then record the first value of the two consecutive values;

IF a[loop]>0.2 and a[loop+1]>0.2 then c[loop]=b[loop];

END;

RUN;

 

·         EXAMPLE2

 

DATA newdataset1;           

SET sourcedataset;

ARRAY v[*] v1-v6;

ARRAY a[*] a1-a6;

DO loop=2 to dim(v);

*check consecutive value, if satisfy criteria, select relevant value. Here if consecutive values are equal, then record the value of the second consecutive variable to the new variable and assign missing to the corresponding old variable. So, if there are three consecutive variables have the same value, only the second variable will be assigned to missing;

IF         v[loop]=v[loop-1] THEN DO;

a[loop]=v[loop];

v[loop]=.;

END;

END;

RUN;

 

 

DATA newdataset2;           

SET sourcedataset1;

ARRAY v[*] v1-v6;

ARRAY a[*] a1-a6;

*check consecutive value, if satisfy criteria, select relevant value. Here if consecutive values are equal, then record the second value of the consecutive values to the new variable and assign missing to the rest corresponding old variable starting from the second consecutive variables;

DO loop1=1 to dim(v)-1;

IF a[loop1] NE . THEN DO;

IF a[loop1]=v[loop1+1] THEN v[loop1+1]=.;

END;

END;

RUN;

 

 

1.8       Edit cutoff value in a series time (for example days)

 

DATA newfilename; SET filename;

ARRAY day{49} day_1-day_49;

ARRAY d{49} d1-d49;

x=1;

DO i=1 TO 49;

*assign value of that day and after that day to new variables based on cut off time (study day);

IF x<= Study_Day THEN DO;

d{i}=day{i};x=x+1;

END;

/* x=x+1;

If x<Study_Day then d{i}=0;

Else if xStudy_Day then d{i}=1;

End;

Drop x i;

*/

END;

RUN;

 

1.9       Get lowest value from row position

 

DATA newfilename;

SET filename;

ARRAY a[*] a1-a35;

b=9999;

DO loop=1 TO dim(a);

IF a[loop] NE . THEN DO;

IF a[loop]<=b THEN b=a[loop];

END;

END;

DROP loop;

RUN;

 

1.10  Sum value for duplicate records (macro)

 

%MACRO summed(in_file=,out_file=);

PROC SQL;

CREATE TABLE &out_file AS

SELECT DISTINCT var1, sum(var2) as summed_var2, sum(var3) as summed_var3  from &in_file

GROUP BY group_var1, group_var2;

QUIT;

%MEND summed;

 

 

1.11  Column percentage calculation for a series of variables (can be applied to categorical data by excluding or specifying specific categories)  (macro)

 

*very important to set var_1-var_3(whatever is your end of variable) as global;

*to get number of each type, total number of types, percentage;

 

%MACRO iter (cat=,file=, n=);

%DO i=1 %TO &n;

%GLOBAL count_&i;

%GLOBAL count&i;

%GLOBAL percent&i;

PROC SQL NOPRINT;

SELECT count(var_&i) INTO: count_&i FROM &file  WHERE var_&i in (&cat); *specify what category you want to consider into counts;

 

SELECT  count(var_&i) INTO: count&i FROM &file  WHERE var_&i in (1,2,3,4);

QUIT;

/*%put &&count_&i;*/

%LET percent&i= %sysevalf(&&count_&i / &&count&i);

%END;

%MEND iter;

 

DATA base1;

LENGTH varname $15.;

INPUT varname  @@;

DATALINES ;

percent1        percent2        percent3       

;

RUN;

%MACRO summary1(percentagefile=, group=);

DATA &percentagefile;SET base1;

FORMAT percent 10.6 ;

Percent=SYMGET(varname);

group=&group;

RUN;

%MEND summary1;

 

 

 

1.12  Long format to wide format  (macro)

 

/*         file--is the source file you in long format.

Id-- is the sequential number happened and appears the same for each record in your data (like time1, time2, time3).

Subgroup-- is the group variable in your data.

Outfile--is the output file in wide format.*/

 

%MACRO LONGTOWIDE(file=, id=, subgroup=, outfile= );

PROC CONTENTS data=&file

                                                out=cont  noprint;

                        run;

PROC SQL NOPRINT;

SELECT DISTINCT name

INTO: varname1-:varname999

FROM cont (where=(NAME notin ("&id", "&subgroup")));

QUIT;

PROC SORT data=&file;BY &id;run;

%DO i=1 %TO &sqlobs;

%PUT &i &&varname&i;

PROC TRANSPOSE data=&file out=&&varname&i prefix=&&varname&i;

  BY &id;

             ID &subgroup;

             VAR &&varname&i;

RUN;

%END;

DATA &outfile;SET %str(&varname1);    RUN;

%DO j=2 %TO &sqlobs;

DATA &outfile;MERGE &outfile &&varname&j;            RUN;

DATA &outfile;SET &outfile;*drop _name_; RUN;

%END;

%MEND LONGTOWIDE;

 

/*examples*/

DATA long;

INPUT year  famid  faminc spend debt $3. ;

CARDS;

96 1  40000 38000 yes

97 1  40500 39000 yes

98 1  41000 40000 no

96 2  45000 42000 yes

97 2  45400 43000 no

98 2  45800 44000 no

96 3  75000 70000 no

97 3  76000 71000 no

98 3  77000 72000 no

;

RUN;

 

%LONGTOWIDE(file=long, id=year, subgroup=famid,outfile=wide);

 

PROC PRINT data=wide; RUN;

 

1.13  Wide format to long format with two variables in wide format (macro)

/*         file—is the source file you in long format.

Subgroup—is the group variable in your data.

Var1—is the first variable which has a series.

Var2—is the second variable which has a series.

n—is the series number.

Outfile—is the output file in wide format.*/

/* you need to rename your series variable name to var1_1-var1_28(28 is the end of your series of variable) , var2_1-var2_28*/

 

%MACRO WIDETOLONG(file=, subgroup=, var1=, var2=,n=, final=);

PROC SORT data=&file;by &subgroup;RUN;

PROC TRANSPOSE data=&file out=long_&var1 prefix=&var1 ;

            BY &subgroup;

VAR &var1._1-&var1._&n;

DATA long_&var1;set long_&var1;

&var1=sum(of &var1.1,&var1.2 );

DROP &var1.1 &var1.2; RUN;

RUN;

PROC TRANSPOSE data=&file out=long_&var2 prefix=&var2 ;

             BY &subgroup;

VAR &var2._1-&var2._&n;

RUN;

DATA long_&var2;set long_&var2(rename=(&var2.1=&var2));

;RUN;

DATA &final;MERGE long_&var1 long_&var2;BY &subgroup;RUN;

%MEND WIDETOLONG;

 

/*example*/

Wide Data set:

Obs

Subject_Code

Bleeding_1

Bleeding_2

Bleeding_3

IUD_1

IUD_2

IUD_3

1

01-001

0

0

0

0

0

0

2

01-001

1

1

1

0

0

0

3

01-002

1

0

0

0

0

0

4

01-002

0

1

1

0

0

0

 

%WIDETOLONG(file=type1_b_s_group2_IUD, subgroup=subject_code,

var1=bleeding , var2=IUD,n=3, final=test);

 

test (final long format):

 

Obs

Subject_Code

_NAME_

_LABEL_

bleeding

IUD

1

01-001

IUD_1

Bleeding_1

1

0

2

01-001

IUD_2

Bleeding_2

1

0

3

01-001

IUD_3

Bleeding_3

1

0

4

01-002

IUD_1

Bleeding_1

1

0

5

01-002

IUD_2

Bleeding_2

1

0

6

01-002

IUD_3

Bleeding_3

1

0