Finding Web Elements

Selenium works past automating browsers to load the website, call back the required data, and even accept certain actions on the website. Hither, we walk through a practical apply-case, using Selenium to extract information from a website.

Setting up

For the successful implementation of browser automation using Selenium, the software WebDriver needs to exist set upwardly. The name WebDriver is generic — there are versions of WebDriver for all major browsers. We will at present go through the steps to set up WebDriver for Google Chrome, which is called ChromeDriver. Other major browsers accept similar steps:

  1. Install Selenium using a 3rd-party installer such as pip to install it from the command line via this command: pip install selenium.
  2. Download the latest stable release of ChromeDriver from this website, selecting the appropriate version for your operating arrangement.
  3. Unzip the downloaded chromedriver*.zip file. An application file named chromedriver.exe should appear. It is recommended that we place the .exe file on the main folder containing the codes.

Now that we have completed the setup steps, we tin can go on. We have created this dynamic complete search course webpage to run our scraper against. Nosotros tin can start by loading the example page.

To brainstorm with, nosotros import WebDriver from Selenium and set a path to chromedriver.exe. Selenium does non contain its own web browser; it requires integration with third party browsers to run. The selenium.webdriver is used to implement diverse browsers, in this case, Google Chrome. The webdriver.Chrome() method is provided with the path of chromedriver.exe then that it creates an object of the selenium.webdriver.chrome.webdriver.WebDriver class, called "driver" in this example, which volition now provide access to the various attributes and properties from WebDriver. The exectuable chromedriver.exe will exist instantiated at this case or upon creation of the commuter object. The Terminal screen and an empty new window of Google Chrome volition now be loaded.

                                                from                  selenium                  import                  webdriver                                commuter                  =                  webdriver.Chrome('YOUR_PATH_TO_chromedriver.exe_FILE')                          

The new window from Google Chrome is and so provided with a URL using the become() part from WebDriver. The become() method accepts the URL that is to be loaded on the browser. We provide our example website address as an statement to get(). And then the browser volition outset loading the URL:

                              form_url                  =                  "https://iqssdss2020.pythonanywhere.com/tutorial/form/search"                                driver.get(form_url)                          

This can be seen in the post-obit screenshot:

As we tin can see above, a notice is displayed just below the address bar with the message "Chrome is beingness controlled past automated test software". This message too confirms the successful execution of the selenium.webdriver action, and it tin be provided with additional lawmaking to act on or automate the folio that has been loaded.

Following successful execution of the lawmaking, information technology is recommended that we shut and quit the driver to free up system resources. The close() method terminates the loaded browser window. The quit() method ends the WebDriver awarding.

                              driver.close()                driver.quit()                          

Locating web elements

Afterward the new Google Chrome window is loaded with the URL provided, we can find the elements that we need to act on. Nosotros offset need to detect the selector or locator information for those elements of involvement. The easiest way to identify the information is to Audit pages using developer tools. Identify the cursor anywhere on the webpage, right-click to open a popular-upwards card, then select the Inspect option. In the Elements window, move the cursor over the DOM structure of the folio until information technology reaches the desired element. Nosotros then demand to observe data such equally what HTML tag is used for the chemical element, the defined attribute, and the values for the attributes and the structure of the folio.

Side by side, we demand to tell Selenium how to observe a detail element or prepare of elements on a spider web page programmatically and simulate user actions on these elements. We just demand to pass the information we identify in the get-go step to Selenium. Selenium provides various find_element_by methods to find an element based on its aspect/value criteria or selector value that nosotros supply in our script. If a matching element is plant, an instance of WebElement is returned or the NoSuchElementException exception is thrown if Selenium is not able to find any chemical element matching the search criteria. Selenium also provides various find_elements_by methods to locate multiple elements. These methods search and render a list of elements that match the supplied values.

Hither, we will provide an overview of the various find_element_by_* and find_elements_by_* methods, with some examples of their use.

  • find_element_by_id() and find_elements_by_id() methods:
    Render an element or a set of elements that have matching ID attribute values. The find_elements_by_id() method returns all the elements that take the same ID attribute values. Let'south attempt finding the search button from the example website. Here is the HTML code for the search button with an ID attribute value defined equally search. We can find this lawmaking if we Inspect the site and reach this element in its DOM.

                    <input type="submit" id="search" value="Search" name="q" class="button" />              

    Here is an example that uses the find_element_by_id() method to find the search button. We will pass the ID attribute'due south value, search, to the find_element_by_id() method:

                                          search_button                      =                      driver.find_element_by_id("search")                                  
  • find_element_by_name() and find_elements_by_name() methods:
    Return element(s) that accept matching name attribute value(s). The find_elements_by_name() method returns all the elements that have the aforementioned proper noun aspect values. Using the previous instance, we can instead find the search button using its proper name attribute value instead of the ID attribute value in the post-obit way:

                                          search_button                      =                      driver.find_element_by_name("q")                                  
  • find_element_by_class_name() and find_elements_by_class_name() methods:
    Return chemical element(due south) that have matching class attribute value(s). The find_elements_by_class_name() method returns all the elements that have the identical class name attribute values. Using the previous case, we can instead find the search button using its form attribute value in following way:

                                          search_button                      =                      driver.find_element_by_class_name("button")                                  
  • find_element_by_tag_name() and find_elements_by_tag_name() methods:
    Observe element(s) past their HTML tag proper name. The example page displays a search grade which has several grade fields to fill in. Each form field proper name is implemented using an <th> or tabular array header cell tag inside a <tr> or tabular array row tag as shown in the following HTML code:

    Nosotros will apply the find_elements_by_tag_name() method to get all the form field names. In this case, nosotros will kickoff observe the table body implemented equally <tbody> using the find_element_by_tag_name() method so get all the <tr> or table row elements by calling the find_elements_by_tag_name() method on the table body object. For each of the commencement 4 table rows, we then become its form field name using the <th> tag.

                                          table                      =                      commuter.find_element_by_tag_name("tbody")                    entries                      =                      table.find_elements_by_tag_name("tr")                                          for                      i                      in                      range(iv):                                          header                      =                      entries[i].find_element_by_tag_name("th").text                                          print(header)                                  
  • find_element_by_xpath() and find_elements_by_xpath() methods:
    Return element(due south) that are establish by the specified XPath query. XPath is a query language used to search and locate nodes in a XML document. All major web browsers support XPath. Selenium can leverage and employ powerful XPath queries to find elements on a spider web page. 1 of the advantages of using XPath is when we can't find a suitable ID, name, or class aspect value for the element. We can use XPath to either notice the element in absolute terms or relative to an chemical element that does have an ID or name attribute. Nosotros tin can too use defined attributes other than the ID, name, or class with XPath queries. We tin can also observe elements with the help of a partial check on attribute values using XPath functions such as starts-with(), contains(), and ends-with().

    For example, we want to go the 2d grade field name "Grade". This element is defined as a <th> tag, but does not have the ID, name, or course attributes defined. Besides, we cannot use the find_element_by_tag_name() method every bit in that location are multiple <tr> and <thursday> tags defined on the folio. In this case, we can use the find_element_by_xpath() method. To detect the XPath of this element, we Inspect the example site, in the Elements window, move the cursor over its DOM structure and detect the desired element. Nosotros so right-click and choose re-create XPath from the pop-upward card. Nosotros obtain the following XPath of this element:

                    //*[@id="tabular array"]/tbody/tr[2]/th              

    This XPath indicates that the path to our desired chemical element starts from the root and so gain to an element with a unique id (id="table") and so continues until it reaches the desired element. Delight notation that the index of the XPath e'er starts with 1 rather than 0, unlike those of built-in Python data structures. Nosotros then pass this XPath to the find_element_by_xpath() method every bit an statement:

                                          second_header                      =                      driver.find_element_by_xpath('//*[@id="table"]/tbody/tr[2]/thursday').text                                  

    We typically use the XPath method when there exists an element with a unique id on the path to the desired element. Otherwise, this method is non reliable.

  • find_element_by_css_selector() and find_elements_by_css_selector() methods:
    Return element(due south) that are found past the specified CSS selector. CSS is a mode sheet linguistic communication used by web designers to describe the await and feel of a HTML document. CSS is used to define various style classes that can be applied to elements for formatting. CSS selectors are used to discover HTML elements based on their attributes such as ID, classes, types, attributes, or values and much more to apply the defined CSS rules. Like to XPath, Selenium can leverage and employ CSS selectors to find elements on a spider web page.

    In our previous example, in which nosotros wanted to get the search button on the case site, nosotros can use the following selector, where the selector is defined as the chemical element tag forth with the class name. This will find an <input> element with the "btn-default" class name. Nosotros then test it past automating a click on the search push button object we establish and find if it starts the search successfully.

                                          search_button                      =                      commuter.find_element_by_css_selector("input.btn-default")                    search_button.click()                                  
  • find_element_by_link_text() and find_elements_by_link_text() methods:
    Find link(s) using the text displayed for the link. The find_elements_by_link_text() method gets all the link elements that have matching link text. For example, we may desire to get the privacy policy link displayed on the example site. Here is the HTML code for the privacy policy link implemented every bit the <a>, or ballast tag, with text "privacy policy":

                    This is the <a id="privacy_policy" href="/tutorial/static/views/privacy.html">privacy policy.</a><br/>              

    Let'south create a examination that locates the privacy policy link using its text and check whether it'due south displayed:

                                          privacypolicy_link                      =                      commuter.find_element_by_link_text("privacy policy.")                    privacypolicy_link.click()                                  
  • find_element_by_partial_link_text() and find_elements_by_partial_link_text() methods:
    Notice link(s) using fractional text. For example, on the example site, two links are displayed: i is the privacy policy link with "privacy policy" equally text and the other is the term weather condition policy link with "term conditions policy" as text. Let us use this method to discover these links using the "policy" text and cheque whether we have 2 of these links available on the folio:

                                          policy_links                      =                      driver.find_elements_by_partial_link_text("policy")                                          print(len(policy_links))                                  

Demo

This section volition highlight 2 use-cases to demonstrate the use of various find_elements_by methods. Near often we want to scrape data from tables or article text. The ii demos therefore cover these employ-cases.

Scrape tables

Allow'southward examine this dynamic table webpage. This page uses JavaScript to write a table to a <div> element of the page. If we were to scrape this page'south table using traditional methods, we'd just get the loading page, without actually getting the data that we want. Suppose that we want to scrape all cells of this table. The first matter we demand to do is to complete the physical setup steps, as detailed in section 4.i. Nosotros then proceed to load the instance page in our program equally shown beneath. We go through this loading procedure together once more then that yous are going to be used to it.

                                                      from                    selenium                    import                    webdriver                                      import                    time                                    driver                    =                    webdriver.Chrome('YOUR_PATH_TO_chromedriver.exe_FILE')                  table_url                    =                    "https://iqssdss2020.pythonanywhere.com/tutorial/default/dynamic"                                    driver.become(table_url)                  fourth dimension.sleep(two)                              

Nosotros need to enforce our program to pause for some time, in this instance 2 seconds, after the get() function instead of immediately executing the next command considering we demand to ensure that the webpage has been fully downloaded before executing the adjacent command in the plan.

Earlier we really start, we have to think about how to shop the scraped information in a squeamish format, like a .csv file. We can utilise the Python file performance methods to accomplish this. Here's an implementation of creating a file object to write information to:

                                                      file                    =                    open up('C:                    \\                    Users                    \\                    JLiu                    \\                    Desktop                    \\                    Web_Tutorial                    \\                    table.csv',                    "w", encoding                    =                    "utf-8")                              

In order to scrape cells, nosotros demand to locate them in the DOM structure of the example webpage of a tabular array. If we Inspect this page, we tin see that the table is defined with a <tbody> tag inside a <table> tag. Each table row is divers with a <tr> tag and there are multiple table rows. So in the program we scrape all the table rows and store them in a list called as "entries".

                                  table_body                    =                    driver.find_element_by_xpath('//*[@id="result"]/table/tbody')                  entries                    =                    table_body.find_elements_by_tag_name('tr')                              

The commencement table row is the table header row, each of its fields is divers with a <th> tag or a header cell tag. So, we loop each prison cell of this beginning row and write the data in each cell to a string, separated past a comma and ended with a new line. When the looping is over, we write this string to the .csv file every bit i row.

                                  headers                    =                    entries[0].find_elements_by_tag_name('th')                                    table_header                    =                    ''                                                        for                    i                    in                    range(len(headers)):                                      header                    =                    headers[i].text                                      if                    i                    ==                    len(headers)                    -                    1:                                      table_header                    =                    table_header                    +                    header                    +                    "                    \n                    "                                                        else:                                      table_header                    =                    table_header                    +                    header                    +                    ","                                                        file.write(table_header)                              

In each of the other table rows, at that place are multiple data cells and each data cell is defined with a <td> tag. We accept to find each jail cell using find_elements_by methods and get its data. There is no way to directly scrape the whole table. Given this, the logic naturally is to loop row by row, and in each row, loop cell by cell. And so, we demand to have a double for loop in our script.

                                                      for                    i                    in                    range(1,                    len(entries)):                                      cols                    =                    entries[i].find_elements_by_tag_name('td')                                      table_row                    =                    ''                                                        for                    j                    in                    range(len(cols)):                                      col                    =                    cols[j].text                                      if                    j                    ==                    len(cols)                    -                    ane:                                      table_row                    =                    table_row                    +                    col                    +                    "                    \n                    "                                                        else:                                      table_row                    =                    table_row                    +                    col                    +                    ","                                                        file.write(table_row)                              

Finally, nosotros close the driver and the file:

                                  commuter.close()                                      file.close()                              

Scrape text

Permit the states examine this live website of an online article. The article on this page has many subsections, each of which have multiple paragraphs and fifty-fifty bullet points. Suppose that we desire to scrape the whole text of the article. One interesting way to practise it is to scrape all the subsections separately get-go and then concatenate them altogether. The reward of doing it this way is that we tin also get each subsection's text.

Allow united states of america Inspect this website. Let united states motion the cursor to the element of its DOM that defines the article content area. Nether this <div> chemical element, nosotros tin meet that subsection headers have tag names all starting with "h", paragraphs have a <p> tag name, and bullet points parts have a <ul> tag proper noun. The elements with these tag names are all parallel with i other, rather than embedded in a hierarchical construction. This blueprint dictates that we should not write a loop in our script to access them, for example, to access each paragraph under a subsection. Another indicate to notation is that here we use a Python dictionary to store each subsection's text. For each key-value pair in this dictionary, the key stores the subsection title, and the value stores its paragraphs of text. So, this is a convenient data structure to use for this apply-case. The post-obit program implements our strategy above to scrape the whole text of the article:

                                                      # same as the set up chunk of lawmaking                                    ...                  journalAddress                    =                    "https://www.federalregister.gov/documents/2013/09/24/2013-21228/affirmative-action-and-nondiscrimination-obligations-of-contractors-and-subcontractors-regarding"                                                        # aforementioned as the fix upwards chunk of code                                    ...                  time.sleep(2)                                    articleObjects                    =                    driver.find_elements_by_xpath('//div[@id="fulltext_content_area"]/*')                                    articleDictionary                    =                    dict()                  myKey                    =                    ""                                    myValue_total                    =                    ""                                                

The program above has put all web elements related to the article content into a list called equally "articleObjects". Since all of these web elements are in parallel with each other rather than in a nested structure, nosotros simply employ one level of for loop to loop each web element on the list and scrape its content into the right place in the dictionary we accept created as nosotros loop over this list. If the tag name of a web element on the list starts with "h", then its content should be a subsection championship. We scrape its content to a string variable "myKey". If the tag name of a web element on the list starts with "p" or "ul", and so its content should exist either a paragraph or a set of bullet points nether that subsection title. Nosotros scrape its content and append information technology to a string variable "myValue_total". One time nosotros meet with the next subsection title, the programme must have appended all paragraphs and bullet points under the current sebsection title and stored them into a string "myValue_total". At this indicate, we input the fundamental-value pair - the current subsection title as the central and all the paragraphs and bullet points under this subsection championship as its value - into the dictionary. Later on this, nosotros repalce the fundamental, which is the electric current sebsection title, with the next subsection title, and echo the above steps.

                                                      for                    i                    in                    range(len(articleObjects)):                                      tagName                    =                    articleObjects[i].tag_name                                      if                    tagName.startswith("h"):                                      if                    myKey:                                      articleDictionary[myKey]                    =                    myValue_total                                      myKey                    =                    ""                                                        myValue_total                    =                    ""                                                        myKey                    =                    articleObjects[i].get_attribute("innerText")                                      if                    tagName.startswith("p"):                                      myValue                    =                    articleObjects[i].get_attribute("innerText")                                      myValue_total                    =                    myValue_total                    +                    myValue                                      if                    tagName.startswith("ul"):                                      myBullets                    =                    articleObjects[i].find_elements_by_tag_name('li')                                      for                    j                    in                    range(len(myBullets)):                                      myBullet                    =                    myBullets[j].get_attribute("innerText")                                      myValue_total                    =                    myValue_total                    +                    myBullet                  driver.close()                              

Later on the loop is washed, we have scraped all the subsections separately and stored them into a dictionary. Finally, nosotros just need to loop each cardinal-value pair on this dictionary and concatenate their contents altogether as we loop over the dictionary.

                                  article                    =                    ''                                                        for                    central, value                    in                    articleDictionary.items():                                      commodity                    =                    article                    +                    key                    +                    '                    \north\north                    '                    +                    value                    +                    '                    \n\n                    ***************                    \n\n                    '                                                        print(article)                              

NoSuchElementException

When a web element is not found, it throws the NoSuchElementException. The reason for NoSuchElementException can be whatsoever of the following:

  • The way of locating a web element we have adopted doesn't place any element in the HTML DOM.
  • The style of locating a web element we have adopted doesn't uniquely identify the desired element in the HTML DOM and currently finds some other hidden / invisible element.
  • The mode of locating a web element we have adopted is unable to place the desired element as it is not within the browser's Viewport.
  • The way of locating a web element we take adopted identifies the chemical element but is invisible due to presence of the aspect style="display: none;".
  • The WebElement we are trying to locate is within an <iframe> tag.
  • The WebDriver instance is looking out for the WebElement even before the element is present/visibile within the HTML DOM.

The solution to address the NoSuchElementException tin exist either of the following:

  1. When the element we locate does non be in the DOM, use try-except issue handler to avoid the termination of the program:
                                                from                  selenium.common.exceptions                  import                  NoSuchElementException                                                  try:                                  elem                  =                  driver.find_element_by_xpath("element_xpath")                                  elem.click()                                  except                  NoSuchElementException:                                  pass                                          

This solution is to accost the inconsistency in the DOM amidst the seemingly aforementioned pages.

  1. When the page loads, for some reason nosotros may be taken to the bottom of the page, but the element we need to scrape is on the top of the folio and thus is out of view. In this state of affairs, we can locate the element in the DOM first, and then use the execute_script() method to scroll the element into view:
                              elem                  =                  driver.find_element_by_xpath("element_xpath")                commuter.execute_script("arguments[0].scrollIntoView();", elem)                          
  1. In case the element has the aspect way="display: none;", remove the aspect through execute_script() method:
                              elem                  =                  driver.find_element_by_xpath("element_xpath")                driver.execute_script("arguments[0].removeAttribute('way')", elem)                elem.send_keys("text_to_send")                          
  1. Adopt a way of locating a web chemical element which uniquely identifies the desired WebElement. The preferable method is find_elements_by_id(), since the id aspect uniquely identifies a spider web element.

  2. To check if the element is within an <iframe>, traverse up the HTML to locate the respective <iframe> tag and use the switch_to() method to shift to the desired iframe through whatsoever of the following approaches:

                              driver.switch_to.frame("iframe_name")                driver.switch_to.frame("iframe_id")                commuter.switch_to.frame(1)                  //                  1                  represents frame index                          

We can switch back to the main frame by using i of the post-obit methods:

                              driver.switch_to.default_content()                driver.switch_to.parent_frame()                          

A better manner to switch frames would be to induce WebDriverWait() for the availability of the intended frame with expected_conditions fix to frame_to_be_available_and_switch_to_it as in the following examples:

                              WebDriverWait(driver,                  10).until(EC.frame_to_be_available_and_switch_to_it(Past.ID,"id_of_iframe"))                  //                  through Frame ID                WebDriverWait(driver,                  10).until(EC.frame_to_be_available_and_switch_to_it(Past.NAME,"name_of_iframe"))                  //                  through Frame name                WebDriverWait(driver,                  ten).until(EC.frame_to_be_available_and_switch_to_it(By.XPATH,"xpath_of_iframe"))                  //                  through Frame XPath                                WebDriverWait(commuter,                  10).until(EC.frame_to_be_available_and_switch_to_it(By.CSS_SELECTOR,"css_of_iframe"))                  //                  through Frame CSS                          
  1. If the element is not nowadays/visible in the HTML DOM immediately, induce WebDriverWait with expected_conditions set to the proper method equally follows:
  • To wait for presence_of_element_located:
                              element                  =                  WebDriverWait(driver,                  xx).until(expected_conditions.presence_of_element_located((By.XPATH,                  "element_xpath']")))                          
  • To look for visibility_of_element_located:
                              element                  =                  WebDriverWait(driver,                  20).until(expected_conditions.visibility_of_element_located((By.CSS_SELECTOR,                  "element_css")))                          
  • To wait for element_to_be_clickable:
                              element                  =                  WebDriverWait(commuter,                  20).until(expected_conditions.element_to_be_clickable((By.LINK_TEXT,                  "element_link_text")))