Best practices for importing a delimited file in SAS using the DATA step

Statistics
Coding
Author

Vinh Nguyen

Published

July 23, 2013

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

{sas eval=FALSE} proc import datafile="/path/to/my_file.txt" out=work.my_data dbms=dlm replace ; delimiter="|" ; guessingrows=32000 ; run ;

PROC IMPORT isn't a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

{sas eval=FALSE} filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs "random access" to the fileref (i.e., to determine the variable type). PROC IMPORT also isn't suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don't go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

{sas eval=FALSE} filename my_pipe pipe "gunzip -c my_file.txt.gz" ; data my_data ; infile my_file dlm="|" dsd truncover missover lrecl=50000 ; length x2 $50 x3 $25 ; informat x4 date9. ; format x4 date9. ; input x1 x2 $ x3 $ x4 ; run ;