So I found a bunch of binary data and in the datastore and some of it turned out to be xml which had certain interesting bits of information stored in them. Even though it is relatively straightforward to run a sql profiler while you click around in the console application to find the specific thing you are looking for, I wanted to see if there is some way we can figure out all of the information that is ‘hiding’ in those pieces of XML. It would be nice to have this more or less fully automated as I expect different versions might hold a different set of properties. Also, stuff like job results will probably have some variation in their xml, and they will only show up once you have run one of the corresponding built in tasks.
WARNING: The following code examples should probably be considered a health hazard; once they have been seen, they cannot be unseen. I am somewhat aware that T-SQL is probably not the most practical tool for what I am trying to achieve. I am fully aware of the fact that I’m an idiot for trying anyway. Here goes nothing.. Proceed at your own risk.
Since not all binary columns had XML in them first we need a way to verify if it actually is xml, not some binary data or a just a plain old string. My initial thought was to use TRY CATCH constructions around a CAST or CONVERT or something along those lines. Fortunately SQL Server 2012 introduced the TRY_CONVERT statement which looked like a good fit.
Result:
It seems that SQL server has no problem converting a string to the xml data type. Although probably correct as it would be considered to be a stand-alone text node or something along those lines it does not work well for what I want to achieve here. I could not find a way to alter the behavior of CONVERT or CAST functions such that a ‘normal’ string fails conversion (other than containing at least a ‘<‘ character).
After some searching (and then some more) I stumbled upon the sp_xml_preparedocument procedure which Microsoft tells us:
“Reads the XML text provided as input, parses the text by using the MSXML parser (Msxmlsql.dll), and provides the parsed document in a state ready for consumption. This parsed document is a tree representation of the various nodes in the XML document: elements, attributes, text, comments, and so on.”
which sounds promising. Double check:
Result:
That looks to do more or less what I want, I’ll roll with it for now. So we have a way to locate all columns of type ‘image’ and check if the contents are more or less XML. Since they do not necessarily have the same structure, even if they are in the same column, the next step is to select all rows for each of these and somehow extract all nodes. I started looking for a way to enumerate the elements in an arbitrary XML document and the first search result is this piece of code (thanks! Aaronaught).
Happily hacking away in SSMS to get this query working, I finally plug in this last bit of code, remove typos etc, and BOOM! 90k results :D oh boy. Lot’s of duplicates in there so just this one more DISTINCT keyword in the right place aaaand…
Huh? ok, check for typos, fat fingers, undo a couple of times, retry.. nada. WTF?! Try previous, much simpler query.. nothing. No errors, just instantly nothing. And then it hit me; I check the memory and yeah..
A parsed document is stored in the internal cache of SQL Server. The MSXML parser (Msxmlsql.dll) uses one-eighth the total memory available for SQL Server. To avoid running out of memory, run sp_xml_removedocument to free up the memory.
Restart SQL, tray again.. the query runs for 20 seconds and returns 1082 glorious unique rows. At that exact moment this random spotify playlist I had on in the background starts to play Comptine d’un Autre Été – L’Après Midi by Yann Tiersen. I stare at the results for a moment and they look pretty convincing to me. Time to switch to Rammstein, clean this thing up a bit, add some comments and present you with FrankenQuery™
I think even the wordpress sourcecode parser is trying to tell me to stop this nonsense, but nonetheless the query works (at least on my machine) and now we finally have a table with all “hidden properties” in these xml strings trying to hide as binary blobs. It might be helpful in deepening our understanding of the inner workings of AM a bit. This way we can always run it again when a new version comes out. Output from my development environment:
And the radio is playing.. as if the universe is talking to me:
Morgenstern ach scheine
auf die Seele meine
Wirf ein warmes Licht
auf ein Herz das bricht
Sag ihr dass ich weine
Denn du, du bist hässlich
Du bist einfach hässlich