hexa.ninja mascot

Mangled <noscript> content with the browser XMLSerializer

By Thomas Sileo5 min read

Investigating <noscript> serialization weirdness.

While working on bigbrowser, which indexes web pages via a browser extension, I noticed the content extractor was including HTML <link> tags in the extracted text content:

<link rel="stylesheet" href="/styles/noscript.css">
Rest of the extracted content ...

Wait, how did HTML make it through the content extraction process?

bigbrowser is also storing the browser-rendered HTML used for the extraction, and it looked like this:

<noscript>
    </noscript></head><body>&lt;link rel="stylesheet" href="/styles/noscript.css"&gt;

That’s a very unusual HTML snippet, why would the <body> have HTML-entities escaped <link> tags?

Investigation

The first obvious step is to look at the page source in the browser using View Page Source:

<noscript>
    <link rel="stylesheet" href="/styles/noscript.css">
</noscript>

This looks like valid HTML.

Since our extension is capturing the rendered HTML, I also looked at the HTML through the Web Developer Tools. Same valid HTML.

Since the HTML document looks valid, it must be coming from the web extension!

A perfect storm

1. noscript element behavior

It has to be related to how <noscript> is handled during serialization, right?

This tag is meant to embed HTML for browsers when script is disabled (see the HTML spec):

  • When disabled, <noscript> content is parsed as normal HTML
  • When enabled, browsers will parse <noscript> as raw text (no need to parse HTML that won’t be rendered)

2. XMLSerializer behavior

The web extension captures the browser-rendered document using XMLSerializer:

new XMLSerializer().serializeToString(document)

The <noscript> tags get serialized as:

<noscript>&lt;link rel="stylesheet" ...&gt;</noscript>

3. Double serialization

To fully capture a rendered HTML page, bigbrowser does a second pass over the DOM in order to inline CSSOM styles.

CSSOM (CSS Object Model) is used to manipulate stylesheets via JavaScript, it’s like a “DOM for CSS” totally separate from the actual DOM.

// Step 1: the DOM gets serialized
const serializedDOM = new XMLSerializer().serializeToString(document);

// Step 2: later on the DOM is parsed to inject the style discovered in CSSOM
const doc = new DOMParser().parseFromString(serializedDOM, "text/html");

// Step 3: once the styles are inlined, the DOM gets serialized a second time
const processedHtml = new XMLSerializer().serializeToString(doc);

Now to the crazy part:

  • From the browser, a JavaScript-enabled context, HTML within a <noscript> tag is serialized with HTML entities
  • During the second step, the HTML parser sees &lt;link...&gt;, but DOMParser is a non-JavaScript-enabled context!
  • This trips the HTML5 parser, as raw text is not allowed in <head>, error recovery kicks in and:
    • closes the <head> section
    • starts the <body> section and throws the text there

You can reproduce it just by opening the developer tools and running this snippet (if you dare):

const afterFirstSerialize = `<!DOCTYPE html>
<html>
<head>
  <title>Test</title>
  <noscript>&lt;link rel="stylesheet" href="style.css"&gt;</noscript>
</head>
<body>
  <p>Hello world</p>
</body>
</html>`;

// This is the result of Step 1 (the serialized page)
console.log(afterFirstSerialize);

// Step 2: Re-parse (to inline CSSOM styles)
const doc = new DOMParser().parseFromString(afterFirstSerialize, "text/html");

// Step 3: Serialize again
const afterSecondSerialize = new XMLSerializer().serializeToString(doc);
console.log(afterSecondSerialize);

You will see:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head>
  <title>Test</title>
  <noscript></noscript></head><body>&lt;link rel="stylesheet" href="style.css"&gt;


  <p>Hello world</p>

</body></html>

The Fix

The fix is actually quite simple, simply remove <noscript> before the first serialization:

function extractDOM() {
  const docClone = document.cloneNode(true);
  
  // Remove all noscript elements as they get mangled during serialization:
  // This is the first step when we extract HTML, the noscript content is
  // converted to text, but once it's reparsed in order to inline CSSOM styles
  // and converted back to HTML, the parser cannot handle text nodes
  // inside head elements in a scripting disabled context,
  // it gets inserted in a newly-created body and we end up with
  // "HTML converted to HTML entities" in the body.
  docClone.querySelectorAll("noscript").forEach((el) => el.remove());

  return {
    serializedDOM: new XMLSerializer().serializeToString(docClone),
  };
}

This whole thing really feels like the perfect storm, and I wasn’t expecting HTML parsers to behave this way when encountering text within the <head> section of a document.

Funny enough, I had to use HTML entities while writing this post to properly display the <noscript> tags!

Also note that I am using Firefox and that I haven’t tried to see if Chrome or other browsers share the same behavior.