Simplifying SAS Viya Part 4: Altering the DATA Step to Run in CAS

Welcome back to my Simplifying SAS Viya series! So far, we have broken down the two main servers in SAS Viya (Compute and CAS), defined caslibs and how to use them, and discussed how to load data into memory. Once data has been loaded into memory, it becomes available for use on the CAS server. That’s great news… but what does that actually mean?

It means we can access in-memory data in a caslib using the DATA step, procedures, or in other SAS Viya applications. This post focuses on the DATA step. I’ll break down how the DATA step processes data on the CAS server, highlight helpful options, and explain considerations when using the SUM statement or grouping data with BY variables.

Before You Begin

Before diving into new ways to work with the DATA step in CAS, here’s a friendly reminder: you do not have to use CAS to manipulate your data if you’re happy with the current state of your programs. In that case, keep the programs as-is - they will run on the Compute server. After data manipulation is complete, load the table into memory for use in other Viya applications.

If you are ready to modify your programs to leverage CAS… read on!

Considerations and Helpful Options

Before running a DATA step on the CAS server:

Start a CAS session.
Ensure there are in-memory tables in a caslib connected to a libref. These steps have been detailed in my previous posts.

The examples in this post were created in SAS Viya for Learners. Before running the examples, I created a CAS session named CJsSession. CASUSER is a caslib with an in-memory table named CHOC_ENTERPRISE.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

After this has been done, both the input and output tables must specify a caslib and the DATA step must be fully CAS enabled. Some restrictions can cause the DATA step to run on Compute instead of CAS. The full list is detailed here.

The SESSREF= option on the DATA statement can be used to specify to only run the DATA step in CAS. If it is not CAS enabled, an error will be generated.

Example:

The following DATA step attempts to create EU_Orders in the WORK library, rather than a caslib. With the SESSREF= option, the step fails and produces an error.

data work.EU_Orders /sessref= CJsSession;
     set casuser.choc_enterprise;
     where continent="Europe";
run

Once the WORK library is changed to the caslib CASUSER, the DATA step runs in CAS as expected.

data casuser.EU_Orders /sessref=CJsSession;
     set casuser.choc_enterprise;
     where continent="Europe";
run;

Another useful system option to generate in-depth messages in the log is MSGLVL=i. When added before the same code as the previous example, the log contains an additional note confirming the WHERE clause ran in CAS.

options msglevel=i;

data casuser.EU_Orders /sessref=CJsSession;
     set casuser.choc_enterprise;
     where continent="Europe";
run;

Behind the Scenes

When the DATA step runs in CAS, data is distributed among multiple threads. CAS automatically determines how many rows of data go to each thread, and the DATA step runs independently on each. The number of threads varies by environment. Use PUT _THREADID_= _N_ to visualize how the data has been distributed.

Example:

data casuser.EU_Orders;
     set casuser.choc_enterprise end=eof;
     where continent="Europe";
     if eof then put _threadid_= _N_=;
run;

Partial results:

The rows in CASUSER.CHOC_ENTERPRISE were distributed to multiple threads- about 500 rows per thread. Each thread completes processing at different times, which is why the threads are not in sequential order in the log. As threads complete processing, they return to their original position on a linked list to maintain row order. This distribution of data is known as multithreaded processing.

SUM Statement

Distributing data works great for speed, but how might this effect a program with an accumulating column?

The sum statement is considered a full table operation, meaning the entire table must be analyzed to produce an accurate result.

In the program below, we would typically expect one row with the total number of orders placed in France. However, this program produces 64 rows. This is because the data was distributed among 64 threads and each thread calculated the sum independently. Results from each thread were returned to the controller, compiled and output to the CASUSER.EU_ORDERS data set.

data casuser.EU_Orders;
     set casuser.choc_enterprise end=eof;
     where continent="Europe";
     if country_nm="France" then FranceOrders+1;
     if eof=1 then output;
     keep FranceOrders;
     if eof then put _threadid_= _N_=;
run;

Partial results:

There are a few ways to remedy these unexpected results:

Option 1: Run a second, single-threaded DATA step.

On the CAS server, the DATA step runs multithreaded by default, but it can also run single threaded. To run the DATA step single threaded, add SINGLE=YES to the DATA statement. When working with large data, the recommended approach is to write the first DATA step multithreaded, and the second single threaded to summarize the condensed results.

The following DATA step uses the summarized EU_ORDERS table created above and creates EU_ORDERS_SUM to generate the final total of France orders. Because the data was already summarized to 64 rows, running a second step single threaded is very efficient (0.01 seconds).

data casuser.EU_Orders_Sum / single=yes;
     set casuser.eu_orders end=eof;
     TotalFranceOrders+FranceOrders;
     keep TotalFranceOrders;
     if eof=1 then output;
     if eof then put _threadid_= _N_=;
run;

Option 2: Run the original DATA step single-threaded.

In the original DATA step, add / SINGLE=YES to the DATA statement. This avoids writing a second DATA step, but may increase run time.

The following DATA step uses the original CHOC_ENTERPRISE table and creates EU_ORDERS with one row totaling the number of France orders in one step. Notice the step took longer than the previous method (0.07 seconds). While this may not be a significant time savings on this data size, it can vary with larger tables.

data casuser.EU_Orders_Single / single=yes;
     set casuser.choc_enterprise end=eof;
     where continent="Europe";
     if country_nm="France" then FranceOrders+1;
     if eof=1 then output;
     keep FranceOrders;
     if eof then put _threadid_= _N_=;
run;

Option 3: Use the Compute server.

Remember you can still run the DATA step on the Compute server without modifying code.

Grouping Data with One BY Variable

When grouping data on the CAS server, there is no need to presort (yay!). Data is distributed to multiple threads, however, each BY group remains together and is processed on the same thread.

When using multiple variables on the BY statement, the first variable determines how rows are distributed to threads. If the first BY variable has a format, CAS uses the formatted value for grouping. Also, note that the descending option is not supported for the first BY variable when running in CAS.

The following PROC SQL step confirms there are four distinct values in the country_nm column.

proc sql;
select distinct country_nm
     from casuser.choc_enterprise;
quit;

The following DATA step contains a BY statement. Only four threads are used, one for each BY group value in country_nm. The DATA step produces four rows with the total number of orders placed in each country.

data casuser.choc_country;
     set casuser.choc_enterprise end=eof;
     by country_nm;
     if first.country_nm then CountryTotal=0;
     CountryTotal+1;
     if last.country_nm then output;
     keep country_nm CountryTotal;
     if eof then put _threadid_= _N_=;
run;

Multiple BY Variables Consideration

While all values within the first BY group stay together on one thread, the row order within a group can vary. This may lead to different results each time the DATA step runs.

Example:

The following DATA step is grouped by two variables: country_nm and city_nm. The first time the step is run, the first row contains the customer Miss Chanelle Gregory. On the second run, the first row contains the customer Rufus Tate.

data casuser.choc_city;
     set casuser.choc_enterprise end=eof;
     by country_nm city_nm;
     keep country_nm city_nm transaction_id customer_name;
     if eof then put _threadid_= _N_=;
run;

Run 1 partial results:

Run 2 partial results:

Solution: ADDROWID Option

To maintain row order, use the ADDROWID option when loading data into CAS. This adds a column to the output data set named _ROWID_ that increments in numeric value with each row. These values will remain unchanged.

The following DATA step reads the table CHOCOLATE_ENTERPRISE_REPORTING in the CHOC library and creates the CHOC_ROWID in-memory table in the CASUSER caslib. Because of the ADDROWID option, _ROWID_ is added as the last column.

data casuser.choc_rowid (addRowId=yes);
     set choc.chocolate_enterprise_reporting;
run;

When grouping CASUSER.CHOC_ROWID with multiple variables, _ROWID_ can be used to ensure the rows are read in order within the groups.

Example:

In the following DATA step, _ROWID_ is not used on the BY statement. When run multiple times, the order of the rows within the groups of country_nm and city_nm is not guaranteed. This can easily be seen as Harriet Ranks is the first row on the first run, but Rick Allen is the first row on the second run of the same DATA step.

data casuser.withoutrowid;
     set casuser.choc_rowid end=eof;
     by country_nm city_nm;
     keep country_nm city_nm transaction_id customer_name;
     if eof then put _threadid_= _N_=;
run;

Run 1 partial results:

Run 2 partial results:

When _rowid_ is added to the DATA step as the final variable on the BY statement, the original data set row order within the BY groups is maintained. This can be seen as Mr. Nick Riley is the first customer after the first and second run of the DATA step.

data casuser.withrowid;
     set casuser.choc_rowid end=eof;
     by country_nm city_nm _rowid_;
     keep country_nm city_nm transaction_id customer_name _rowid_;
     if eof then put _threadid_= _N_=;
run;

Run 1 partial results:

Run 2 partial results:

Conclusion

Throughout this post we covered:

What it means for a DATA step to be CAS enabled and helpful options.
What happens behind the scenes when the DATA step runs in CAS.
How to modify the DATA step when using a SUM statement in CAS.
How grouping works with one or more BY variables in CAS.

All examples were created in SAS Viya for Learners 4, a powerful and free tool for educators and students to explore SAS Viya capabilities. All syntax is applicable in other SAS Viya environments.

For more on SAS Viya, check out the previous posts in this series and the following SAS Viya course:

Find more articles from SAS Global Enablement and Learning here.