by Adam Tuttle
As so many of us do, I moonlight when time permits and there's a new toy I want to buy. (When isn't there?) In this article I will describe a ColdFusion error message that I ran into while working on a client's website, how I attempted to debug it, and how I ultimately found out it was a bug in ColdFusion, and helped to resolve it.
In order to understand some of the decisions I made, you'll have to know some of the project's context. This client has a rather large, rather old code base – let's say 20,000 templates, most written circa ColdFusion 4.5 – and rather shallow pockets. They wanted me to fix a single small problem they were having with their home-rolled e-commerce system: Orders were being authorized by the payment gateway, but not stored in the database. (As with a lot of small clients, they can't always afford to go back and clean up ugly code when it works.) Aside from the sort of unkempt code you might expect in this situation, the architecture is antiquated – entirely custom-tag based – and riddled with problems like cryptic nested switches.
For these reasons, I decided that the best possible solution for the client's needs was to fix the problem template as a black-box without making any changes to the existing code, database tables, or processes outside of it. In a general sense, that decision ended up serving me well.
As mentioned, occasionally orders placed on the site were being authorized via the payment gateway, but not stored in the orders table, which told me that we had an order-of-operations problem. If preliminary order data is stored in the database, it can always be authorized later, even if it has to be done in meatspace, using such antiquated technology as telephones. However, if the credit card is authorized first and then the order information is lost due to a timeout or other error before being stored to the database, it's gone forever, and so is the sale. That's money left on the table.
I knew I was going to have to rewrite this step in the process, and I knew I had a few primary objectives:
After rewriting the sub-sub-sub-custom-tag in question to store the order data in a new orders_temp table and then authorize and move the data to the real (pre-existing) orders table, I realized that there were quite a few steps that didn't affect the content returned to the user. Several of those steps involved querying the database, a typical speed bottleneck. I wrapped those queries up in a CFThread so that they could run in the background and the page would return to the user a bit faster.
When I tested this solution, I hit the error in question:
Of course I Googled it, and scoured the LiveDocs for clues, finding nothing. With nowhere else to turn, I asked for advice on StackOverflow, a new programming Q & A site that's commonly described as "Experts-Exchange, without the evil"
, and a site that I usually find very useful for situations like this. If you Google the error message above, the only semi-relevant results you'll find are my StackOverflow question and a few pages that steal its content and try to make it look like their own. (And, soon, this article.) Unfortunately, despite the best efforts of the responders there, I wasn't making any headway.
I decided to at least try to find the line of code that was causing the error, even if I couldn't figure out a way to fix it. With no other options, I resorted to a binary search. I commented out the first half of the template (approximately) and launched the page. When the error was still there, I knew it had to be something in the second half of the page. The next step is to comment out half of the remaining code – the third quarter of the page – and to continue commenting out half of the remaining code until I can pinpoint the error. This process can be tedious and frustrating, but it generally works; and it's a lot faster than commenting out a few lines at a time in a linear search.
Unfortunately, it didn't really work for me in this case. I was only able to positively narrow down the problem to my CFThread tag, but that was a large chunk of the page. If I commented the entire block out – CFThread tags and all – the error went away. If I commented out just the contents of the CFThread, the error went away. If I copied the contents of the CFThread to a separate template and ran that template, the error went away.
Wait... what?
So, with an empty CFThread tag:
... there was no error. With the thread tag content un-commented, but the tags themselves commented out:
... there was no error. But with all of it uncommented:
... I got an error! How is that even possible?!
At this point, I attempted to binary search the thread contents in the same manner as before, hoping to find something in there that caused the error. No luck. In fact, things only got weirder. I would comment out random blocks here and there – all the queries, all custom tag calls, etc. – and sometimes it would work, sometimes it wouldn't. I thought maybe I had been awake too long and called it a night, hoping that I could see something new with fresh eyes in the morning.
Nope! As a matter of fact, I fought this error off and on for over a month before, on the advice of Sean Corfield, I sent a desperate email to Adam Lehman, the Product Manager for ColdFusion and the yet-to-be-released Bolt IDE. I explained the situation to Adam, sent a stack trace of the error and a zip with the code in question, crossed my fingers, knocked on wood, threw some salt over my shoulder, rubbed my rabbit's foot, held onto a horseshoe, and hit send.
While I was waiting for a response, I had an epiphany. If the code would run in a separate template, could I simply CFInclude it into the thread? Yes, I could, and yes, this got the error message to go away. Wahoo!
As luck would have it, Adam responded within a day or two to let me know he was passing my request off to the engineering team, and another day later I had my answer: A bug in ColdFusion!
The bug is similar to another known issue with ColdFusion – that CFCs can only contain so many lines of code before they won't compile. The issue I was experiencing was that there were just too many lines of code inside the CFThread. This makes me wonder if it was ironically a stack overflow error, but there was no mention of a stack overflow in the Java stack trace.
It fits the symptoms perfectly and explains everything I saw. For the record, the workaround is to encapsulate the contents of the thread in some manner. You can use a CFInclude, as I did, a UDF, or even a custom tag.
The Adobe engineers thanked me for submitting my findings, and let me know that they had since resolved the bug. They offered a hotfix if necessary, but with such an easy workaround and my client using third party hosting, I didn't bother. I do expect that the fix will be included in the next cumulative Hotfix for ColdFusion 8.01 (if there is one), and ColdFusion 9 a.k.a Centaur. Looking back, the answer was almost obvious, and you can bet I'll think about the workaround if and when I run into similar issues in the future.
I hope that this explanation of my thought process and the steps I took to work through the issue helps you debug your applications in the future. And when it comes right down to it, if you think you're doing everything right, and you've done your due-diligence in ensuring that your CFML adheres to spec and doesn't violate any rules or guidelines in the Developer's Guide, don't be afraid to send it up the chain to Adobe. It could be a bug in ColdFusion.
Adam Tuttle is a Certified Advanced ColdFusion Developer and Senior Programmer/Analyst for the Wharton Learning Lab at the University of Pennsylvania, where he develops innovative-technology based simulations and learning materials for enhancing the classroom experience, primarily through the use of ColdFusion and Flex. He is also a self-proclaimed Mango Blog evangelist and blogs at fusiongrokker.com.