Monday, January 18, 2010

Concatenation of Text and Binary Data in T-SQL

Some month ago, I blogged about "String Split Functions" in SQL Server, today I'll write about the inverse part of this topic; The concatenation of text and binary data of several rows, using T-SQL (and the available extensions). As we'll see, just like string splitting, there are several ways to skin the cat.

Topics covered in this blog entry:
  • Classical, recursive @variable assignment within an SELECT statement
  • FOR XML
  • A WHILE loop and utilizing the BLOB data types WRITE() method
  • SQLCLR
Topics not covered in this blog entry:
  • Recursive @variable assignment within a CURSOR/WHILE-loop. This produces the same memory management as the SELECT based version but in a non-set-based manner.
  • SQLCLR as a aggregate function. Reason will be described later.

Test Data
Since we'll speak about performance in below topics, here are two simple tables containing two different kinds of data to be concatenated.

The below table contains 200 rows with 180,000 characters and bytes per row. This means 36,000,000 characters/bytes will be concatenated.
CREATE TABLE LargeData (
   Id INT NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,Txt VARCHAR(MAX)
   ,Bin VARBINARY(MAX)
);

-- fill the table with 200 rows 180,000 characters per row
WITH Numbers AS (
   SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) Num
   FROM master.sys.all_columns
)
INSERT INTO LargeData
   SELECT TOP(200)
      -- text data
      REPLICATE(CONVERT(VARCHAR(MAX), NEWID()), 5000)
      -- binary data
      ,CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARCHAR(MAX), NEWID()), 5000))
   FROM Numbers;

As a second kind of source data we'll work with a table containing 50,000 rows but only 16 characters and bytes per row. This means 800,000 characters/bytes to be concatenated.
CREATE TABLE ManyData (
   Id INT NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,Txt VARCHAR(16)
   ,Bin BINARY(16)
);

-- fill the table with 200 rows 180,000 characters per row
WITH Numbers AS (
   SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) Num
   FROM master.sys.all_columns c1, master.sys.all_columns c2
)
INSERT INTO ManyData
   SELECT TOP(50000)
      -- text data
      LEFT(CONVERT(VARCHAR(36), NEWID()), 16)
      -- binary data
      ,CONVERT(VARBINARY(MAX), NEWID())
   FROM Numbers;

For sake of brevity, I'll provide only one sample statement in the following topics. You can find the complete SQL script as attachment at the end of this blog entry.

Classical @variable Assignment
The classical approach to concatenate data from rows into one column (or variable in our case) is an SELECT statement with a recursive variable assignment.
DECLARE 
   @txt VARCHAR(MAX) = ''
   ,@bin VARBINARY(MAX) = 0x;

SELECT 
      @txt += TextColumn
      ,@bin += BinaryColumn
   FROM AnyTable;
(The “+=”-Operator is a new feature of SQL Server 2008. If you work on SQL Server 2005, just replace it with a variable reassignment like “@txt = @txt + TextColumn”. The “+=”-Operator does exactly the same; it’s just a simplified way to write T-SQL.)

This is generally a good solution due to three facts. It’s easy to use. It’s a set-based solution. It’s fast for less data to be concatenated. On the other hand this kind of concatenation runs into problems if you have to handle many or large data, because all data are copied several times.

A look into Profiler shows a huge resource usage, especially for concatenating of many rows to be handled. A concatenation of only 800,000 bytes or characters caused over 180,000,000 read and 1,400,000 write operations.



FOR XML
This is a kind of tricking another build-in feature of SQL Server 2005 (and further versions). FOR XML is generally made to create XML output from a SELECT statement. However, it can also be used to concatenate data without any XML-style.

I learned this trick on SQLServerCentral.com. One article that shows the usage is Jeff Moden's article about Performance Tuning: Concatenation Functions and Some Tuning Myths. (Special thanks to Bluedog67 for this link :-) .)

Another good resource in web for T-SQL is Mladen Prajdić's page where this trick is shown in "SQL Server 2005: Concat values XML Style".

DECLARE @txt VARCHAR(MAX);
SELECT @txt = (SELECT Txt FROM LargeData FOR XML PATH(''));
Literally the text to be provided within the PATH(‘’) is made to provide the name of a parent XML element used for the output data. If you provide an empty string you get a concatenated text of all data returned by the query.

This is a very powerful solution to concatenate text data. Unfortunately it has to be used carefully if source data are unknown. Since this method works with XML, any kind of functional XML characters like “>” or “&” become masked by this method. Binary data cannot be concatenated by XML PATH().

A look into profiler shows a great resource usage and duration.



WRITE Method
SQL Server 2005 introduced the new BLOB data types VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX). As well as previous BLOB data types (TEXT, NTEXT and IMAGE), also the old BLOB functions like TEXTPTR or UPDATETEXT became marked as deprecated. As a replacement for those functions, SQL Server 2005 introduced a new method called WRITE() which is a data type method - not a usual T-SQL function. This means, it is not called to get a BLOB as a parameter but it is called on a BLOB data type, like methods in a object-oriented programming language.

DECLARE 
   @all VARCHAR(MAX) = ''
   ,@buffer VARCHAR(MAX)
   ,@position INT = 0
   ,@len INT
   ,@i INT = 0
   ,@max INT = (SELECT MAX(Id) FROM LargeData);

WHILE (@i < @max)
BEGIN
   -- get the next row value
   SELECT TOP(1)
         @i = Id
         ,@buffer = Txt
         ,@len = LEN(Txt)
      FROM LargeData
      WHERE Id > @i
      ORDER BY Id;

   -- add the text to the existing text
   SET @all.WRITE(@buffer, @position, @len);
   SET @position += @len;
END;
Unfortunately, there seem to be no way (I know?) to use the WRITE method in a set-based way. It is possible to use it within an UPDATE statement but it is called only once and not for each row of the source data.

A look into Profiler shows, the WRITE method much faster than the classical approach of variable reassignment. It's two times faster with ten times less resource usage for larger data and it's 60% faster with about four times less resource usage with many data to be concatenated.



However, is there a way to get this WRITE method faster than this? As I already wrote above, it can also be used in an UPDATE statement, so this should also be tried with a temp table.
-- Large Binary
DECLARE 
   @all VARBINARY(MAX) = 0x
   ,@i INT = 0
   ,@max INT = (SELECT MAX(Id) FROM LargeData);

DECLARE @buffer TABLE (Bin VARBINARY(MAX));

INSERT INTO @buffer
   SELECT 0x;

WHILE (@i < @max)
BEGIN
   -- get the next row value
   WITH cte AS (
      SELECT TOP(1)
         Id
         ,Bin
      FROM LargeData
      WHERE Id > @i
      ORDER BY Id
   )
   UPDATE b SET
         -- get the next id
         @i = cte.Id
         -- write into temp table
         ,Bin.WRITE(cte.Bin, DATALENGTH(b.Bin), DATALENGTH(cte.Bin))
      FROM @buffer b CROSS JOIN cte
END;
SELECT @all = Bin FROM @buffer;

Sure, the SQL statement is a bit more complicated since we have to use a common table expression to get the (ORDERED) next row of source data, but what's about performance?

A look into Profiler shows, execution for many, small data didn't really change. On the other hand concatenation of large data became about 80% faster with only 30% resource usage. So this - more complicated - way to use the WRITE method might be a good investment if large data manipulation is needed.



SQLCLR
Well, if you don't consider to ever enable CLR in SQL Server this blog ends here :-).

Generally, SQLCLR provides two different ways to perform this concatenation.

The first way would be a user-defined aggregate function which gets the text or binary data to be concatenated. This solution would provide a very handy way to be used like this.
SELECT CONCATENATE_TEXT(Txt) FROM ManyData;
Unfortunately, this solution would require to set the resulting assembly to unsafe mode and server to be configured at TRUSTWORTHY level. Why? All SQLCLR functions (procedures, functions, ...) are static methods on a wrapper .NET class and aggregate functions are called again and again for each row to be proceeded. Within this function CLR only get the current value to be handled, what means CLR would have to remember the previous data anywhere in a static (shared) variable. Due to concurrency calls this would have to handled a custom thread-locking to store and get information. I find SQLCLR provides some powerful extensions but an unsafe assembly tastes bad. Due to this fact, I'll not provide this solution for now. (If anybody is interested in this, please leave a short note and I'll provide this in another blog entry.)

The second way to handle the concatenation is a simple user-defined scalar function which gets the SQL query to be executed. This requires a dynamic SQL approach to use the function, though, it is build-in thread safe because it doesn't require any class member variables.

Here's a C# method to concatenate text data.
[Microsoft.SqlServer.Server.SqlFunction(
   DataAccess=DataAccessKind.Read,
   IsDeterministic=false,
   IsPrecise=true,
   SystemDataAccess=SystemDataAccessKind.None
   )]
public static SqlChars ConcatText (
   // the sql statement to be executed
   [SqlFacet(IsFixedLength=false ,IsNullable=false ,MaxSize=-1)]
   SqlString sqlIn
   ) {

   // create a System.Text.StringBuilder to push all text data into.
   StringBuilder buffer = new StringBuilder();

   // connect to database and execute the provided SQL statement
   using (SqlConnection cn = new SqlConnection("Context Connection=true")) {
      cn.Open();
      using (SqlCommand cmd = new SqlCommand(sqlIn.Value, cn))
      using (SqlDataReader reader = cmd.ExecuteReader()) {
         while (reader.Read()) {
            // get the next string value from reader and append it to buffer
            string str = (string)reader[0];
            buffer.Append(str);
         }
      }
   }

   // return the concatenated string
   return new SqlChars(buffer.ToString());
}

Here's a C# method to concatenate binary data.
[Microsoft.SqlServer.Server.SqlFunction(
      DataAccess=DataAccessKind.Read,
      IsDeterministic=true,
      IsPrecise=true,
      SystemDataAccess=SystemDataAccessKind.None
      )]
   public static SqlBytes ConcatBinary(
      // the sql statement to be executed
      [SqlFacet(IsFixedLength = false, IsNullable = false, MaxSize = -1)]
      SqlString sqlIn
      ) {

      // create a System.IO.MemoryStream to push all text data into
      MemoryStream stream = new MemoryStream();

      // connect to database and execute the provided SQL statement
      using (SqlConnection cn = new SqlConnection("Context Connection=true")) {
         cn.Open();
         using (SqlCommand cmd = new SqlCommand(sqlIn.Value, cn))
         using (SqlDataReader reader = cmd.ExecuteReader()) {
            while (reader.Read()) {
               // get the next binary value from reader and append it to buffer
               byte[] data = (byte[])reader[0];
               stream.Write(data, 0, data.Length);
            }
         }
      }
      // important! reset the stream position to the beginning
      stream.Seek(0, SeekOrigin.Begin);

      return new SqlBytes(stream);
   }

The call from T-SQL is equal to any other scalar function call.
DECLARE @bin VARBINARY(MAX);
SELECT @bin = dbo.ConcatBinary('SELECT Bin FROM LargeData');

Due to the fact that SQLCLR (.NET in general) does not know ANSI text like VARCHAR all non-Unicode data have to be translated into Unicode and back into ANSI. Therefore I extended this tests with two additional executions into NVARCHAR variables instead of VARCHAR.

A look into Profiler shows, only XML transformation is slightly faster than SQLCLR, without any restrictions for functional characters and it fully supports binary concatenation. In case of large text data concatenation into a NVARCHAR variable SQLCLR is even slightly faster than XML.



Conclusion
Classical concatenation by recursive variable assignment is a nice way to handle less, small data. In case of larger or many data it appears to be not the best solution.

New WRITE method is a nice extension for SQL Server, especially when working with larger data.

FOR XML PATH('') is a very powerful trick for text data concatenation. Use it (really!), as long as you keep in mind, it is a kind of a hack and should be used carefully.

We all (including me!) know, there are very less useful ways to use SQLCLR. But concatenation of data as well as string splitting are some of those and might be considered if this is a common task in a SQL Server 2005/2008 environment.

Attachments

Post Scriptum
Thanks again to Bluedog67, who showed me another blog about string concatenation by Anith Sen at:
Concatenating Row Values in Transact-SQL
It also covers the CLR aggregate function.

... didn't know this blog entry before...

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.