Solved: Proc Import Puts An Unwanted Apostrophe Before a Variable

jdadams

Dear Folks:

I'm using Proc Import to read in a semicolon delimited file, but the output file places an apostrophe (and sometimes more) in front of a variable, which happens to be a name. My Proc Import statement follows below, and after that I include a snippet of raw data starting at the top of this variable length file. I apologize for raising what may be a basic question, but I honestly do not know the answer. In my output, claremont h reads as 'claremont h, and later on the same line, claremont reads as 'claremont, and so on. Perhaps I should have used Infile-Input instead of Proc Import. Thanks for any help that you can provide.

proc import datafile = 'G:\PubDB_CSV_2015\public.individual.csv'
out = individual
dbms = dlm REPLACE;
delimiter = ';';

guessingrows = 500;
getnames = YES;
datarow = 2;
run;

individual_id;name_hash;first_name;first_initial;second_initial;last_name;individual_field

111;claremont h;heather;h;a;claremont;MEDI

113;braley g;"";g; ;braley;MEDI

jdadams

I wanted to update my original message on whether Proc Import placed an unwanted apostrophe before a variable. It turns out that, while this can happen with Excel (not SAS) as a flag indicating text to follow, Excel did not do this with my data. Instead, the data turned out to be valid.

The variables were last names, and the data were sorted so that observations with apostrophes in front appeared first. They say that truth can be stranger than fiction,and that was the case here. By sorting cases with apostrophes first, Dutch names were displayed. Since there are almost 600,000 names, there are very many Dutch names. For example, 't Hart is a Dutch surname, and there are others, like 't Hooft. (The characters 't stand for Het, which is Dutch for the article 'The'.)

My international data just happened to show these cases. So the problem in this case turned out to be Dutch Names and my lack of familiarity with them.

View solution in original post

Tom

Your example file works fine.

options parmcards=csv;
filename csv temp;

parmcards4;
individual_id;name_hash;first_name;first_initial;second_initial;last_name;individual_field
111;claremont h;heather;h;a;claremont;MEDI
113;braley g;"";g; ;braley;MEDI
;;;;

proc import file=csv dbms=csv out=want replace;
  delimiter=';';
run;

proc print;
run;

I would say most likely your original file actually has the single quotes. Excel uses that character to force interpretation of a cell as character instead of numbers. So you probably cannot SEE the single quote when looking at the CSV file using EXCEL. Instead look at the CSV file using a TEXT editor. Or just use SAS to look at the file.

data _null_;
  infile 'G:\PubDB_CSV_2015\public.individual.csv' ;
  input;
  if index(_infile_,"'") ;
  examples+1;
  if examples > 10 then stop;
  list;
run;

jdadams

Hello Tom:

Thanks for the suggestion. I'll look into this further using Infile-Input which worked in your rendering of the example data. Thanks for your help.

jdadams

I wanted to update my original message on whether Proc Import placed an unwanted apostrophe before a variable. It turns out that, while this can happen with Excel (not SAS) as a flag indicating text to follow, Excel did not do this with my data. Instead, the data turned out to be valid.

The variables were last names, and the data were sorted so that observations with apostrophes in front appeared first. They say that truth can be stranger than fiction,and that was the case here. By sorting cases with apostrophes first, Dutch names were displayed. Since there are almost 600,000 names, there are very many Dutch names. For example, 't Hart is a Dutch surname, and there are others, like 't Hooft. (The characters 't stand for Het, which is Dutch for the article 'The'.)

My international data just happened to show these cases. So the problem in this case turned out to be Dutch Names and my lack of familiarity with them.

ballardw

Thank you @jdadams for a clear description of the data issue. Learning such things can be long row to hoe.

Had a similar issue with phone numbers once upon a time. We had code to clean up a phone number field that was not stored in a clearly defined form. The same number could appear as (123) 555-1234, 123 555-1234, 1235551234 plus things like extensions, with or without something like EXT or other indicator (as well as other stuff sometimes). So we had code that stripped all the non-digit characters and selected the first 10 to get the 1235551234 version. Turned out that we had some international numbers in the data and the first 2 characters were the nation code. So we had numbers originating in Brazil that we called as if they were in Colorado.

jdadams

Dear @ballardw Thanks for your comment on my apostrophe data problem. It means a lot to me given your leadership role in the SAS Community. I think this episode tells an interesting story, that you should check your data as well as the software that accesses it.

Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

Re: Proc Import Puts An Unwanted Apostrophe Before a Variable

The 2025 SAS Hackathon Kicks Off on June 11!

SAS Training: Just a Click Away