Explaining the description generator

The SEO Framework was initially named AutoDescription because its sole purpose was generating descriptions.

Since its inception, we have continued to uphold that name. Moreover, we don’t want you to spend hours on SEO; that’s wherefore our plugin is. So, we’ve developed and integrated symbolic-and logical interpretation AI that works with every language you throw at it. It is intelligent: we can’t anticipate what you will write, but we know that any description that comes out is excellent. It is better than what Google can generate, so keep that option enabled in TSF!

Of course, we assume you write accordingly: “Introduction – Context – Conclusion.”

As of TSF v4.2.7, the generator works as described below. For the sake of simplicity, we’re describing how it generates descriptions for posts and pages only.

The generated description is not stored in the database but is regenerated every time the page loads; unless you use a caching plugin. The generation is fast: it takes under a millisecond on even the most complex pages.

The sequences

The generator works in three sequences:

The getter obtains the content information;
The parser converts what the getter obtained to a description;
The renderer formulates the parsed content into a safe and usable string.

Step one, the getter

first takes your content (or excerpt, when available);
strips sole link paragraphs (often used for Twitter and Youtube embeds);
strips all shortcodes, which is why some page builders don’t work with this;
strips all headers, images, scripts, lists, forms, etc. (i.e. plausible non-canonical and non-sentence content) (see full list below at † Footnote);
removes all leftover HTML;
converts the content into a single line;
converts non-break spaces and tabs to spaces;
converts reverse solidus (\) to \;
converts sequential spaces to single spaces.

Now, we have clean and workable content for our parser. We removed some steps from the explanation since it’s more than convoluted.

† Footnote: Stripped HTML Tags

For description generation, the following HTML tags will have their elements (including content) removed entirely. This means that Hello<code>example</code>World becomes HelloWorld.
— address, area, aside, audio, blockquote, button, canvas, code, datalist, del, dialog, dl, fieldset, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, iframe, input, label, map, menu, meter, nav, noscript, ol, object, output, pre, progress, s, script, select, style, svg, table, template, textarea, ul, and video.

The following HTML tags have their content kept but have added spaces around them. This means that Hello example World becomes Hello example World. See the ‡ footnote below for details.
— article, br, blockquote, details, div, hr, p, and section.

All other unlisted HTML elements will have their content kept as well, but without further processing. This means that HelloWorld becomes HelloWorld.

‡ Footnote: HTML Passes

Not all elements are alike: Some break the content over to new lines (e.g., paragraph tag ), and some are inline phrases (e.g., bold tag ).

The getter can interpret HTML intelligently. To be most performant, we didn’t set up a browser inside your website, but use regular expressions. Our bespoke regular expressions can interpret your content, but only in layered passes: the getter parses all tags from the outside to the inside. When the content is deeply nested, not everything might be parsed.

<section>                        <!-- pass 1 -->
	<div>                        <!-- pass 2 -->
		<p>Hello</p><p>World</p> <!-- pass 3 -->
	</div>                       <!-- pass 2 -->
</section>                       <!-- pass 1 -->

Becomes after 1 pass:

<div>                        <!-- pass 2 -->
	<p>Hello</p><p>World</p> <!-- pass 3 -->
</div>                       <!-- pass 2 -->

Becomes after 2 passes:

<p>Hello</p><p>World</p> <!-- pass 3 -->

And finally after 3 passes:

Hello World

When you set the HTML parsing method to “Fast (max. 2 passes)”, the parser won’t process pass 3, and the above HTML won’t have the  tags processed, but only removed, resulting in HelloWorld. If you use more passes, the above HTML will be parsed into Hello World, with a proper space in between.

It is unlikely your site requires an accurate parsing method, because most content isn’t this complex. In any case, the parser will stop passing when no more HTML is left over, or when no changes were registered between passes.

Step two, the parser

converts all HTML entities to human-readable entities (· → ·);
takes a number of characters from the start of the content, excluding the last word or punctuation on that boundary:
- for Open Graph descriptions, it’s 200 characters;
- for Twitter descriptions, it’s 200 characters;
- for meta descriptions, it depends on the WordPress language set:
  - 148 characters for Assamese (অসমীয়া);
  - 158 characters for German (Österreichisch Deutsch, Schweiz Deutsch, & Deutsch);
  - 148 characters for Gujarati (ગુજરાતી);
  - 100 characters for Malayalam (മലയാളം);
  - 70 characters for Japanese (日本語);
  - 82 characters for Korean (한국어);
  - 120 characters for Tamil (தமிழ்);
  - 70 characters for Chinese (繁體中文, 香港中文版, & 简体中文);
  - 160 characters for all other languages.
texturizes the content as WordPress would, creating sentence structure and balance;
breaks up the content into 3 distinctive parts (it’s actually 6, but let’s keep it “simple”):
1. sentence after leading punctuation, but including opening punctuation, marks, and ¡¿, until first punctuation or end of content;
2. the last punctuation character, but ignores connecting punctuation with a word-boundary (e.g., the ' in it's is ignored);
3. The last parts consist of the leading words after the last punctuation character:
  - If there are more than three leading words, it uses the complete sentence.
  - If there are fewer than three leading words, it trims those words.
  - If there are no words after the last punctuation character, it does nothing.
trims leading spaces and non-closing punctuation;
when no closing punctuation is found at the end of the sentence, it adds ... (converted to … in the renderer);
converts the content back to machine-readable.

Now we are left with your description, or we have nothing at all. Because we may add symbols to make sentences or inviting content, it can consist of more characters than we would typically recommend.

Step three, the renderer

texturizes the description (converting ... to …, transforming quote tags, etc.);
converts HTML tags to readable HTML, so no intended tags are lost. E.g., <code> remains <code>;
removes all leftover HTML (for secure display, just in case);
converts “Wordpress” to “WordPress”, so to keep Mr. Mullenweg happy;
finally trims all spaces.

Now we have secure content that can be used anywhere on your site. For a final secure touch-up, we escape the description for the meta tag’s attribute output.

Known difficulties

For now, PHP is a single-threaded scripting language, and this means it can only do one thing at a time. We don’t want your site visitors to wait for a hidden tag to load, so we are strict and careful about what we integrate. It is why our software is performant and why it takes many months to release updates.

With that, the description generator is almost perfect, but not quite. For example, our generator isn’t aware of your intent, nor does it calculate the pixels used. So, if you want to make it perfect, consider if it’s worth your time to fill in the meta tag yourself. The people we work with don’t have that time, and so we made this plugin: The SEO Framework was originally called “Autodescription.”

The SEO Framework · KB

The sequences

Step one, the getter

† Footnote: Stripped HTML Tags

‡ Footnote: HTML Passes

Step two, the parser

Step three, the renderer

Known difficulties

Commercial

Professional

Rational

Practical