Efficient extraction of data using binary search and ordering information

I'm quite excited and honoured to host a guest entry from Yusuf Moosa Motara covering his talk at ZaCon (a video of which can be found here, and the slides here).

Caveat

I make no claims about the novelty of this technique. For all I know, it's been widely known for some time, but I could find no record of it. The target server is SQL Server in this text, but the technique is applicable to other implementations. If nobody else bothers to claim it, I hereby name the technique after myself: nothing impresses random chicks quite as much as having an obscure and totally-irrelevant-to-their-lives technique named after oneself, after all.

On second thought -- call it whatever you want to ;).

Background

Some injection points are worse candidates than others. A difficult injection point is found in the ORDER BY clause of a SQL statement, since there can be no further statement after an ORDER BY and our only real "in" is the ability to put in expressions, though those expressions may do nothing more than change the resulting order. Such a change of ordering does provide us with a 1-bit channel to slip data along. We could do so a simple manner:

[...] ORDER BY case when (select top 1 substring(username,1,1) from
users)=<91>a<92> then ID else Address end

... but a quick examination will tell us that the time taken to extract data with this technique is around 0.5*/n/*/m/, where /n/ is the length of the data, and /m/ is the size of the alphabet, assuming that (on average) the correct character is found after half the search-space has been traversed. Reordering the character test order in accordance with the frequencies that we expect to find in the text can yield better results, but not results that are significantly better.

A better way

We can do more than directly compare letters: we can get ordering information out. This allows us to use a binary search algorithm to find the correct text efficiently. In other words, we can use a query similar to:

[...] ORDER BY case when (select top 1 cast(username as varbinary(255)) from
users where username not in ('')) __OP__ cast('__CANDIDATE__' as varbinary(255))
then ID else Address end

The subselect provides us with a way to get the first row of data. After getting the column data for that row, we can modify the "not in" clause to include the retrieved data, thus shifting us to the next row. __OP__ is either ">" or "<", depending on whether we would like to search the upper or lower space. __CANDIDATE__ is the string that we are comparing against.

Obtaining __CANDIDATE__ is a simple enough matter of converting text to a number that reflect the ordering of the text. Unfortunately, SQL makes things difficult by supporting concepts like collation; therefore, the lexicographic ordering of the text in your language may not align with the ordering of the text in SQL. We can solve this problem by using the same injection point and asking the SQL server to order our alphabet for us. For a ~90-character alphabet, this takes ~300 queries. We will make up for this setup cost within the first two or three records.

SQL Server doesn't play well with strings. If there are trailing spaces, they aren't counted towards the string length; if there are apostrophes, comparisons become more interesting (for example, "mooo" < "moo''a", but "mooo" > "moo''z" [ed: note single and double apostrophes look similar in that example]). To get around both of these (and a few other string-related idiocies), we cast to binary and compare as byte-sequences.

A binary search searches a range, and we should set our initial range. A lower bound of 0 will do, and the upper bound should be set to the largest /n/-valued sequence, where /n/ is the length of the string to be retrieved, as determined by a preliminary length-seeking binary search using the same injection point.

The alphabet used must cover all of the possible letters. Failure to do this will cause the binary search to fail; more importantly, it will make it difficult to proceed to the next record unless one can identify that record by ROWID or some other method.

Web-specific Complications

Certain sequences, such as "\e<zmo~~" or "Lor,w&#$Jo" get detected, incorrectly, by WAFs as being XSS vectors. This illustrates the stupidity of blacklist-based security precautions. The only workaround I can think of at present is to remove the "<" and "&"/"#" characters from the alphabet, and ensure that we skip past sequences that contain such things by modifying our SQL injection query.

As the number of retrieved items grows, the length of the query grows as well. Therefore, this technique is of limited use when faced with a GET injection. A solution to this might be to use adaptive queries: when the strings "Alice", "Alison", "Albert", "Alyssa", and "Bart" have been retrieved, it might be possible to change the seen strings to "Al%" and "Bart", thus reducing the length of the query. I leave this extension as an exercise to the reader.

Pay-off

A binary search finds its target in log2(n)+1 probes, at worst. In real-world terms, this leads to an average of a 6x increase in efficiency (and hence speed) for pulling out data, as compared to the simplistic method of exploiting the ordering side-channel: basically, the difference between waiting 10 minutes for a long text blob or 1 hour for the same blob.

Notes

Though we apply the technique to free-form text extraction here, it would be trivial to apply it to the extraction of GUIDs, or numeric data. Numeric data extraction would be a particularly interesting exercise, since a binary search works by continually narrowing the search-space. It might be sufficient to get a "ballpark" figure in some cases, and thus obtain better than O(log2(n)) efficiency.

I have created a small proof-of-concept, consisting of a toy ASP.Net application that is susceptible to injection, a SQL Server database, and the extraction tool coded in Python3. If the code differs from the explanatory text (which is the text that you are reading right now, of course!), please place your trust in the code.

Further work

There are a number of optimisations that could be done to the existing proof-of-concept. More importantly, the idea of using remote comparison functions to efficiently draw out previously-inaccessible data via binary search is an intriguing one, and one which deserves more attention. Andrew MacPherson has pointed out to me that a similar technique is used by the Metasploit framework when trying to determine the EIP in some cases, and I am grateful for the knowledge; I am sure that there are even more applications for the idea just waiting to be found!

Tuesday, December 1. 2009
Posted by Dominic White in Security
Comments (0) | Trackbacks (0)

Name
Email
Homepage
In reply to
Comment
	Remember Information? Subscribe to this entry