On This Page
advertisement

The transforms in this category access the Document Object Model (DOM) of the current document:

Transform Description
attribute Searches the current HTML document for an element that matches the query terms and returns the named attribute
query Searches the current HTML document for an element that matches the query terms and returns the text of the element using textContent
queryInner Searches the current HTML document for an element that matches the query terms and returns the text of the element using innerText

attribute:selector:index:attribute-name

The following information is intended for advanced users.

Returns the value of the HTML attribute with the given attribute-name from the indexth HTML element that matches the selector. All the parameters are required.

The attribute transform is intended to allow ORA users to write templates that access attributes, such as HREF values, that ORA does not extract. The selector and index parameters have the same purpose as for the query transform.

The selector parameter must be a valid CSS selector. It is beyond the scope of this help page to explain CSS Selectors.

The attribute transform has a couple unusual characteristics:

  • The attribute transform may only be used with a special Field named "DOM". If you attempt to use it with any other Field, the result will be an empty string.
  • The attribute transform cannot be tested on the OraSettings page. The attribute transform must have the HTML of the page available, and that HTML is not available from the OraSettings page. Instead, the attribute transform inspects the HTML for the OraSettings page, not the collection page for which it is intended.

Example

To return the HREF attibute of the first "A" (link) element on the page: [DOM:attribute:a:1:href]. If the page has an A element, the result is the text of the HREF attribute.

Selectors are usually more involved than the selector used in the example above.

HREFs

To convert an HREF attribute value to a full URL, pass the attribute result to the hrefToUrl transform:

[DOM:attribute:a:1:href:hrefToUrl]

query:selector:index

The following information is intended for advanced users.

Returns the text of the indexth HTML element that matches the selector. If the optional index is not supplied, it defaults to 1, and the transform will return the text of the first matching element using its textContent property. Otherwise, it will return the text of the indexth HTML element.

The query and queryInner transforms are very similar and only differ in how the text of the element is interpreted. You may have to experiment to see which one is better suited to your usage.

The query transform uses the Document.querySelectorAll() method. It is intended to allow ORA users to write templates that access text that ORA does not extract.

The selector parameter must be a valid CSS selector. It is beyond the scope of this help page to explain CSS Selectors.

The query transform has a couple unusual characteristics:

  • The query transform may only be used with a special Field named "DOM". If you attempt to use it with any other Field, the result will be an empty string.
  • The query transform cannot be tested on the OraSettings page. The query transform must have the HTML of the page available, and that HTML is not available from the OraSettings page. Instead, the query transform inspects the HTML for the OraSettings page, not the collection page for which it is intended.

Examples

  1. Return the text of the first "H2" element on the page:
    [DOM:query:h2]

    If the page has an H2 element with the text "Part One", the result is "Part One".

  2. Return the second "H2" element on the page:
    [DOM:query:h2:2]

    If the page has two H2 elements, "Part One" and "Part Two", the result is "Part Two".

Selectors are usually more involved than the selectors used in the examples above.

queryInner:selector:index

The following information is intended for advanced users.

Returns the text of the indexth HTML element that matches the selector. If the optional index is not supplied, it defaults to 1, and the transform will return the text of the first matching element using its innerText property. Otherwise, it will return the text of the indexth HTML element.

The query and queryInner transforms are very similar and only differ in how the text of the element is interpreted. You may have to experiment to see which one is better suited to your usage.

The queryInner transform uses the Document.querySelectorAll() method. It is intended to allow ORA users to write templates that access text that ORA does not extract.

The selector parameter must be a valid CSS selector. It is beyond the scope of this help page to explain CSS Selectors.

The queryInner transform has a couple unusual characteristics:

  • The queryInner transform may only be used with a special Field named "DOM". If you attempt to use it with any other Field, the result will be an empty string.
  • The queryInner transform cannot be tested on the OraSettings page. The queryInner transform must have the HTML of the page available, and that HTML is not available from the OraSettings page. Instead, the queryInner transform inspects the HTML for the OraSettings page, not the collection page for which it is intended.

Examples

  1. Return the text of the first "H2" element on the page:
    [DOM:queryInner:h2]

    If the page has an H2 element with the text "Part One", the result is "Part One".

  2. Return the second "H2" element on the page:
    [DOM:queryInner:h2:2]

    If the page has two H2 elements, "Part One" and "Part Two", the result is "Part Two".

Selectors are usually more involved than the selectors used in the examples above.

Advice

The attribute, query, and queryInner transforms—always applied to the "DOM" pseudo-field—are examples of "giving users enough rope". They are a doorway into a technical world that most end users will find confusing, daunting, tedious, or worse. It's best to avoid these transforms. I only recommend using them when absolutely necessary and only when I think some of the issues I describe below can be avoided.

The rest of this section will mostly focus on the query and queryInner transform, but applies equally well to the attribute transform.

The first challenge associated with using the query and queryInner transforms is defining the selector parameter, which is a CSS selector. CSS selectors are very powerful, but they are intended for use by programmers or HTML and CSS authors.

To use the query and queryInner transforms, in addition to having some familiarity with CSS selectors, you also have to inspect the HTML page to see how it is structured. The Developer Tools facility built-in to most browsers is invaluable for that task.

If you right-click on a part of a web page and choose the "inspect" item from the context menu, that will open the Developer Tools panel and highlight an HTML node. That is the typical starting point for writing a selector that matches the current node, but it is only a starting point. You have to review the HTML node in context to determine the appropriate CSS selector value. There are many selectors that will work, but many will be too loose or too tight.

  • A "too loose" selector will select a lot of other nodes, and that will make the selector fragile. The selector may work for the exact current page, but won't work for pages associated with other records in the same collection.

    For example, if your query or queryInner transform uses a selector that matches 28 DIV elements, and you specify index 17, the 17 may be correct on the current page, but on some other page, you may need 16, or 18, or some other index value.

  • A "too tight" selector will select only one or a handful of nodes on the current page, but may not match any nodes on any other pages. The issue here is web servers often customize the HTML and CSS and a slight change by the server will invalidate CSS selectors that depend on those values.

    For example, you may see an HTML element where the class includes a prefix or suffix with a value that seems computed like class="rightSideCss_r1y3a6mc". If you write a selector that includes the suffix ("_r1y3a6mc"), the next time you visit the site, that suffix may change and your selector will no longer work.

So, like Goldilocks, you need to find the "just right" selector between too loose and too tight. Doing so often requires a lot of technical knowledge and experience.