I am working on a project where I depend on the excellent HTML parsing component HtmlAgilityPack.
However, as some others also have found, the component sometimes throws a StackOverflowException when parsing certain complex documents, which in CLR 2.0 means an instant and unrecoverable death of the process (from an internal FailFast).
It could be a good deed rewriting the parsing mechanism by substituting the recursive methods causing the stack to overflow, but it is pretty central to the parsing so I went for a work around for now.
Since StackOverflowException is an unrecoverable exception, there is no way you can catch it and do something. You cannot use a separate Application Domain either as the FailFast call will kill the process.
The only thing you can do to handle a StackOverflowException is to make it not happen in the first place, either by rewriting the code causing it. Or if you don't have time to go too deep in the third party code (like me) perform a work around.
First get the source code for HtmlAgilityPack.
Then add this class (which I found here) somewhere in the main project (named HtmlAgilityPack):
public class StackChecker
{
public unsafe static bool HasSufficientStack(long bytes)
{
var stackInfo = new MEMORY_BASIC_INFORMATION();
// We subtract one page for our request. VirtualQuery rounds UP to the next page.
// Unfortunately, the stack grows down. If we're on the first page (last page in the
// VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this
// doesn't work right for IA64 due to bigger pages.
IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096);
// Query for the current stack allocation information.
VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));
// If the current address minus the base (remember: the stack grows downward in the
// address space) is greater than the number of bytes requested plus the reserved
// space at the end, the request has succeeded.
return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) >
(bytes + STACK_RESERVED_SPACE);
}
// We are conservative here. We assume that the platform needs a whole 16 pages to
// respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB,
// which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks
// incorrectly.
private const long STACK_RESERVED_SPACE = 4096 * 16;
[DllImport("kernel32.dll")]
private static extern int VirtualQuery(
IntPtr lpAddress,
ref MEMORY_BASIC_INFORMATION lpBuffer,
int dwLength);
private struct MEMORY_BASIC_INFORMATION
{
internal uint BaseAddress;
internal uint AllocationBase;
internal uint AllocationProtect;
internal uint RegionSize;
internal uint State;
internal uint Protect;
internal uint Type;
}
}
What's happening here is that the kernel32!VirtualQuery method provides us with sufficient information so that we can determine how much space there is left on the stack. Read more on that.
Then add the following code in the beginning of the methods participating in heavy recursing (such as HtmlNode.WriteTo(TextWriter outText) and HtmlNode.WriteTo(XmlWriter writer)):
if (!Kernel.StackChecker.HasSufficientStack(4*1024))
throw new NearStackOverflowException("The document is too complex to parse");
As you can see I have subclassed Exception and introduced a Near death exception type for this purpose, but you can do what you want about that. Either way, this exception can be caught and the process can be saved.
It does not really solve the problem, but it provides a work around that makes the HtmlAgilityPack useful while we the real problem gets solved.
Happy scraping!