Mangled <noscript> content with the browser XMLSerializer
By Thomas Sileo • • 5 min read
While working on bigbrowser, which indexes web pages via a browser extension, I noticed the content extractor was including HTML <link> tags in the extracted text content:
<link rel="stylesheet" href="/styles/noscript.css">
Rest of the extracted content ...
Wait, how did HTML make it through the content extraction process?
bigbrowser is also storing the browser-rendered HTML used for the extraction, and it looked like this:
<noscript>
</noscript></head><body><link rel="stylesheet" href="/styles/noscript.css">
That’s a very unusual HTML snippet, why would the <body> have HTML-entities escaped <link> tags?
Investigation
The first obvious step is to look at the page source in the browser using View Page Source:
<noscript>
<link rel="stylesheet" href="/styles/noscript.css">
</noscript>
This looks like valid HTML.
Since our extension is capturing the rendered HTML, I also looked at the HTML through the Web Developer Tools. Same valid HTML.
Since the HTML document looks valid, it must be coming from the web extension!
A perfect storm
1. noscript element behavior
It has to be related to how <noscript> is handled during serialization, right?
This tag is meant to embed HTML for browsers when script is disabled (see the HTML spec):
- When disabled,
<noscript>content is parsed as normal HTML - When enabled, browsers will parse
<noscript>as raw text (no need to parse HTML that won’t be rendered)
2. XMLSerializer behavior
The web extension captures the browser-rendered document using XMLSerializer:
new XMLSerializer().serializeToString(document)
The <noscript> tags get serialized as:
<noscript><link rel="stylesheet" ...></noscript>
3. Double serialization
To fully capture a rendered HTML page, bigbrowser does a second pass over the DOM in order to inline CSSOM styles.
CSSOM (CSS Object Model) is used to manipulate stylesheets via JavaScript, it’s like a “DOM for CSS” totally separate from the actual DOM.
// Step 1: the DOM gets serialized
const serializedDOM = new XMLSerializer().serializeToString(document);
// Step 2: later on the DOM is parsed to inject the style discovered in CSSOM
const doc = new DOMParser().parseFromString(serializedDOM, "text/html");
// Step 3: once the styles are inlined, the DOM gets serialized a second time
const processedHtml = new XMLSerializer().serializeToString(doc);
Now to the crazy part:
- From the browser, a JavaScript-enabled context, HTML within a
<noscript>tag is serialized with HTML entities - During the second step, the HTML parser sees
<link...>, butDOMParseris a non-JavaScript-enabled context! - This trips the HTML5 parser, as raw text is not allowed in
<head>, error recovery kicks in and:- closes the
<head>section - starts the
<body>section and throws the text there
- closes the
You can reproduce it just by opening the developer tools and running this snippet (if you dare):
const afterFirstSerialize = `<!DOCTYPE html>
<html>
<head>
<title>Test</title>
<noscript><link rel="stylesheet" href="style.css"></noscript>
</head>
<body>
<p>Hello world</p>
</body>
</html>`;
// This is the result of Step 1 (the serialized page)
console.log(afterFirstSerialize);
// Step 2: Re-parse (to inline CSSOM styles)
const doc = new DOMParser().parseFromString(afterFirstSerialize, "text/html");
// Step 3: Serialize again
const afterSecondSerialize = new XMLSerializer().serializeToString(doc);
console.log(afterSecondSerialize);
You will see:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Test</title>
<noscript></noscript></head><body><link rel="stylesheet" href="style.css">
<p>Hello world</p>
</body></html>
The Fix
The fix is actually quite simple, simply remove <noscript> before the first serialization:
function extractDOM() {
const docClone = document.cloneNode(true);
// Remove all noscript elements as they get mangled during serialization:
// This is the first step when we extract HTML, the noscript content is
// converted to text, but once it's reparsed in order to inline CSSOM styles
// and converted back to HTML, the parser cannot handle text nodes
// inside head elements in a scripting disabled context,
// it gets inserted in a newly-created body and we end up with
// "HTML converted to HTML entities" in the body.
docClone.querySelectorAll("noscript").forEach((el) => el.remove());
return {
serializedDOM: new XMLSerializer().serializeToString(docClone),
};
}
This whole thing really feels like the perfect storm, and I wasn’t expecting HTML parsers to behave this way when encountering text within the <head> section of a document.
Funny enough, I had to use HTML entities while writing this post to properly display the <noscript> tags!
Also note that I am using Firefox and that I haven’t tried to see if Chrome or other browsers share the same behavior.
