How to Read Value From Webpage Using Selenium
Finding Web Elements
Selenium works past automating browsers to load the website, call back the required data, and even accept certain actions on the website. Hither, we walk through a practical apply-case, using Selenium to extract information from a website.
Setting up
For the successful implementation of browser automation using Selenium, the software WebDriver needs to exist set upwardly. The name WebDriver is generic — there are versions of WebDriver for all major browsers. We will at present go through the steps to set up WebDriver for Google Chrome, which is called ChromeDriver. Other major browsers accept similar steps:
- Install Selenium using a 3rd-party installer such as
pipto install it from the command line via this command:pip install selenium.
- Download the latest stable release of
ChromeDriverfrom this website, selecting the appropriate version for your operating arrangement.
- Unzip the downloaded
chromedriver*.zipfile. An application file namedchromedriver.exeshould appear. It is recommended that we place the.exefile on the main folder containing the codes.
Now that we have completed the setup steps, we tin can go on. We have created this dynamic complete search course webpage to run our scraper against. Nosotros tin can start by loading the example page.
To brainstorm with, nosotros import WebDriver from Selenium and set a path to chromedriver.exe. Selenium does non contain its own web browser; it requires integration with third party browsers to run. The selenium.webdriver is used to implement diverse browsers, in this case, Google Chrome. The webdriver.Chrome() method is provided with the path of chromedriver.exe then that it creates an object of the selenium.webdriver.chrome.webdriver.WebDriver class, called "driver" in this example, which volition now provide access to the various attributes and properties from WebDriver. The exectuable chromedriver.exe will exist instantiated at this case or upon creation of the commuter object. The Terminal screen and an empty new window of Google Chrome volition now be loaded.
from selenium import webdriver commuter = webdriver.Chrome('YOUR_PATH_TO_chromedriver.exe_FILE') The new window from Google Chrome is and so provided with a URL using the become() part from WebDriver. The become() method accepts the URL that is to be loaded on the browser. We provide our example website address as an statement to get(). And then the browser volition outset loading the URL:
form_url = "https://iqssdss2020.pythonanywhere.com/tutorial/form/search" driver.get(form_url) This can be seen in the post-obit screenshot:
As we tin can see above, a notice is displayed just below the address bar with the message "Chrome is beingness controlled past automated test software". This message too confirms the successful execution of the selenium.webdriver action, and it tin be provided with additional lawmaking to act on or automate the folio that has been loaded.
Following successful execution of the lawmaking, information technology is recommended that we shut and quit the driver to free up system resources. The close() method terminates the loaded browser window. The quit() method ends the WebDriver awarding.
driver.close() driver.quit() Locating web elements
Afterward the new Google Chrome window is loaded with the URL provided, we can find the elements that we need to act on. Nosotros offset need to detect the selector or locator information for those elements of involvement. The easiest way to identify the information is to Audit pages using developer tools. Identify the cursor anywhere on the webpage, right-click to open a popular-upwards card, then select the Inspect option. In the Elements window, move the cursor over the DOM structure of the folio until information technology reaches the desired element. Nosotros then demand to observe data such equally what HTML tag is used for the chemical element, the defined attribute, and the values for the attributes and the structure of the folio.
Side by side, we demand to tell Selenium how to observe a detail element or prepare of elements on a spider web page programmatically and simulate user actions on these elements. We just demand to pass the information we identify in the get-go step to Selenium. Selenium provides various find_element_by methods to find an element based on its aspect/value criteria or selector value that nosotros supply in our script. If a matching element is plant, an instance of WebElement is returned or the NoSuchElementException exception is thrown if Selenium is not able to find any chemical element matching the search criteria. Selenium also provides various find_elements_by methods to locate multiple elements. These methods search and render a list of elements that match the supplied values.
Hither, we will provide an overview of the various find_element_by_* and find_elements_by_* methods, with some examples of their use.
-
find_element_by_id()andfind_elements_by_id()methods:
Render an element or a set of elements that have matching ID attribute values. Thefind_elements_by_id()method returns all the elements that take the same ID attribute values. Let'south attempt finding the search button from the example website. Here is the HTML code for the search button with an ID attribute value defined equallysearch. We can find this lawmaking if weInspectthe site and reach this element in its DOM.<input type="submit" id="search" value="Search" name="q" class="button" />Here is an example that uses the
find_element_by_id()method to find the search button. We will pass the ID attribute'due south value,search, to thefind_element_by_id()method:search_button = driver.find_element_by_id("search") -
find_element_by_name()andfind_elements_by_name()methods:
Return element(s) that accept matching name attribute value(s). Thefind_elements_by_name()method returns all the elements that have the aforementioned proper noun aspect values. Using the previous instance, we can instead find the search button using its proper name attribute value instead of the ID attribute value in the post-obit way:search_button = driver.find_element_by_name("q") -
find_element_by_class_name()andfind_elements_by_class_name()methods:
Return chemical element(due south) that have matching class attribute value(s). Thefind_elements_by_class_name()method returns all the elements that have the identical class name attribute values. Using the previous case, we can instead find the search button using its form attribute value in following way:search_button = driver.find_element_by_class_name("button") -
find_element_by_tag_name()andfind_elements_by_tag_name()methods:
Observe element(s) past their HTML tag proper name. The example page displays a search grade which has several grade fields to fill in. Each form field proper name is implemented using an<th>or tabular array header cell tag inside a<tr>or tabular array row tag as shown in the following HTML code:
Nosotros will apply the
find_elements_by_tag_name()method to get all the form field names. In this case, nosotros will kickoff observe the table body implemented equally<tbody>using thefind_element_by_tag_name()method so get all the<tr>or table row elements by calling thefind_elements_by_tag_name()method on the table body object. For each of the commencement 4 table rows, we then become its form field name using the<th>tag.table = commuter.find_element_by_tag_name("tbody") entries = table.find_elements_by_tag_name("tr") for i in range(iv): header = entries[i].find_element_by_tag_name("th").text print(header) -
find_element_by_xpath()andfind_elements_by_xpath()methods:
Return element(due south) that are establish by the specified XPath query. XPath is a query language used to search and locate nodes in a XML document. All major web browsers support XPath. Selenium can leverage and employ powerful XPath queries to find elements on a spider web page. 1 of the advantages of using XPath is when we can't find a suitable ID, name, or class aspect value for the element. We can use XPath to either notice the element in absolute terms or relative to an chemical element that does have an ID or name attribute. Nosotros tin can too use defined attributes other than the ID, name, or class with XPath queries. We tin can also observe elements with the help of a partial check on attribute values using XPath functions such asstarts-with(),contains(), andends-with().For example, we want to go the 2d grade field name "Grade". This element is defined as a
<th>tag, but does not have the ID, name, or course attributes defined. Besides, we cannot use thefind_element_by_tag_name()method every bit in that location are multiple<tr>and<thursday>tags defined on the folio. In this case, we can use thefind_element_by_xpath()method. To detect the XPath of this element, weInspectthe example site, in theElementswindow, move the cursor over its DOM structure and detect the desired element. Nosotros so right-click and choosere-create XPathfrom the pop-upward card. Nosotros obtain the following XPath of this element://*[@id="tabular array"]/tbody/tr[2]/thThis XPath indicates that the path to our desired chemical element starts from the root and so gain to an element with a unique id (
id="table") and so continues until it reaches the desired element. Delight notation that the index of the XPath e'er starts with 1 rather than 0, unlike those of built-in Python data structures. Nosotros then pass this XPath to thefind_element_by_xpath()method every bit an statement:second_header = driver.find_element_by_xpath('//*[@id="table"]/tbody/tr[2]/thursday').textWe typically use the XPath method when there exists an element with a unique id on the path to the desired element. Otherwise, this method is non reliable.
-
find_element_by_css_selector()andfind_elements_by_css_selector()methods:
Return element(due south) that are found past the specified CSS selector. CSS is a mode sheet linguistic communication used by web designers to describe the await and feel of a HTML document. CSS is used to define various style classes that can be applied to elements for formatting. CSS selectors are used to discover HTML elements based on their attributes such as ID, classes, types, attributes, or values and much more to apply the defined CSS rules. Like to XPath, Selenium can leverage and employ CSS selectors to find elements on a spider web page.In our previous example, in which nosotros wanted to get the search button on the case site, nosotros can use the following selector, where the selector is defined as the chemical element tag forth with the class name. This will find an
<input>element with the"btn-default"class name. Nosotros then test it past automating a click on the search push button object we establish and find if it starts the search successfully.search_button = commuter.find_element_by_css_selector("input.btn-default") search_button.click() -
find_element_by_link_text()andfind_elements_by_link_text()methods:
Find link(s) using the text displayed for the link. Thefind_elements_by_link_text()method gets all the link elements that have matching link text. For example, we may desire to get theprivacy policy linkdisplayed on the example site. Here is the HTML code for theprivacy policy linkimplemented every bit the<a>, or ballast tag, with text"privacy policy":This is the <a id="privacy_policy" href="/tutorial/static/views/privacy.html">privacy policy.</a><br/>Let'south create a examination that locates the
privacy policy linkusing its text and check whether it'due south displayed:privacypolicy_link = commuter.find_element_by_link_text("privacy policy.") privacypolicy_link.click() -
find_element_by_partial_link_text()andfind_elements_by_partial_link_text()methods:
Notice link(s) using fractional text. For example, on the example site, two links are displayed: i is theprivacy policy linkwith"privacy policy"equally text and the other is theterm weather condition policy linkwith"term conditions policy"as text. Let us use this method to discover these links using the"policy"text and cheque whether we have 2 of these links available on the folio:policy_links = driver.find_elements_by_partial_link_text("policy") print(len(policy_links))
Demo
This section volition highlight 2 use-cases to demonstrate the use of various find_elements_by methods. Near often we want to scrape data from tables or article text. The ii demos therefore cover these employ-cases.
Scrape tables
Allow'southward examine this dynamic table webpage. This page uses JavaScript to write a table to a <div> element of the page. If we were to scrape this page'south table using traditional methods, we'd just get the loading page, without actually getting the data that we want. Suppose that we want to scrape all cells of this table. The first matter we demand to do is to complete the physical setup steps, as detailed in section 4.i. Nosotros then proceed to load the instance page in our program equally shown beneath. We go through this loading procedure together once more then that yous are going to be used to it.
from selenium import webdriver import time driver = webdriver.Chrome('YOUR_PATH_TO_chromedriver.exe_FILE') table_url = "https://iqssdss2020.pythonanywhere.com/tutorial/default/dynamic" driver.become(table_url) fourth dimension.sleep(two) Nosotros need to enforce our program to pause for some time, in this instance 2 seconds, after the get() function instead of immediately executing the next command considering we demand to ensure that the webpage has been fully downloaded before executing the adjacent command in the plan.
Earlier we really start, we have to think about how to shop the scraped information in a squeamish format, like a .csv file. We can utilise the Python file performance methods to accomplish this. Here's an implementation of creating a file object to write information to:
file = open up('C: \\ Users \\ JLiu \\ Desktop \\ Web_Tutorial \\ table.csv', "w", encoding = "utf-8") In order to scrape cells, nosotros demand to locate them in the DOM structure of the example webpage of a tabular array. If we Inspect this page, we tin see that the table is defined with a <tbody> tag inside a <table> tag. Each table row is divers with a <tr> tag and there are multiple table rows. So in the program we scrape all the table rows and store them in a list called as "entries".
table_body = driver.find_element_by_xpath('//*[@id="result"]/table/tbody') entries = table_body.find_elements_by_tag_name('tr') The commencement table row is the table header row, each of its fields is divers with a <th> tag or a header cell tag. So, we loop each prison cell of this beginning row and write the data in each cell to a string, separated past a comma and ended with a new line. When the looping is over, we write this string to the .csv file every bit i row.
headers = entries[0].find_elements_by_tag_name('th') table_header = '' for i in range(len(headers)): header = headers[i].text if i == len(headers) - 1: table_header = table_header + header + " \n " else: table_header = table_header + header + "," file.write(table_header) In each of the other table rows, at that place are multiple data cells and each data cell is defined with a <td> tag. We accept to find each jail cell using find_elements_by methods and get its data. There is no way to directly scrape the whole table. Given this, the logic naturally is to loop row by row, and in each row, loop cell by cell. And so, we demand to have a double for loop in our script.
for i in range(1, len(entries)): cols = entries[i].find_elements_by_tag_name('td') table_row = '' for j in range(len(cols)): col = cols[j].text if j == len(cols) - ane: table_row = table_row + col + " \n " else: table_row = table_row + col + "," file.write(table_row) Finally, nosotros close the driver and the file:
commuter.close() file.close() Scrape text
Permit the states examine this live website of an online article. The article on this page has many subsections, each of which have multiple paragraphs and fifty-fifty bullet points. Suppose that we desire to scrape the whole text of the article. One interesting way to practise it is to scrape all the subsections separately get-go and then concatenate them altogether. The reward of doing it this way is that we tin also get each subsection's text.
Allow united states of america Inspect this website. Let united states motion the cursor to the element of its DOM that defines the article content area. Nether this <div> chemical element, nosotros tin meet that subsection headers have tag names all starting with "h", paragraphs have a <p> tag name, and bullet points parts have a <ul> tag proper noun. The elements with these tag names are all parallel with i other, rather than embedded in a hierarchical construction. This blueprint dictates that we should not write a loop in our script to access them, for example, to access each paragraph under a subsection. Another indicate to notation is that here we use a Python dictionary to store each subsection's text. For each key-value pair in this dictionary, the key stores the subsection title, and the value stores its paragraphs of text. So, this is a convenient data structure to use for this apply-case. The post-obit program implements our strategy above to scrape the whole text of the article:
# same as the set up chunk of lawmaking ... journalAddress = "https://www.federalregister.gov/documents/2013/09/24/2013-21228/affirmative-action-and-nondiscrimination-obligations-of-contractors-and-subcontractors-regarding" # aforementioned as the fix upwards chunk of code ... time.sleep(2) articleObjects = driver.find_elements_by_xpath('//div[@id="fulltext_content_area"]/*') articleDictionary = dict() myKey = "" myValue_total = "" The program above has put all web elements related to the article content into a list called equally "articleObjects". Since all of these web elements are in parallel with each other rather than in a nested structure, nosotros simply employ one level of for loop to loop each web element on the list and scrape its content into the right place in the dictionary we accept created as nosotros loop over this list. If the tag name of a web element on the list starts with "h", then its content should be a subsection championship. We scrape its content to a string variable "myKey". If the tag name of a web element on the list starts with "p" or "ul", and so its content should exist either a paragraph or a set of bullet points nether that subsection title. Nosotros scrape its content and append information technology to a string variable "myValue_total". One time nosotros meet with the next subsection title, the programme must have appended all paragraphs and bullet points under the current sebsection title and stored them into a string "myValue_total". At this indicate, we input the fundamental-value pair - the current subsection title as the central and all the paragraphs and bullet points under this subsection championship as its value - into the dictionary. Later on this, nosotros repalce the fundamental, which is the electric current sebsection title, with the next subsection title, and echo the above steps.
for i in range(len(articleObjects)): tagName = articleObjects[i].tag_name if tagName.startswith("h"): if myKey: articleDictionary[myKey] = myValue_total myKey = "" myValue_total = "" myKey = articleObjects[i].get_attribute("innerText") if tagName.startswith("p"): myValue = articleObjects[i].get_attribute("innerText") myValue_total = myValue_total + myValue if tagName.startswith("ul"): myBullets = articleObjects[i].find_elements_by_tag_name('li') for j in range(len(myBullets)): myBullet = myBullets[j].get_attribute("innerText") myValue_total = myValue_total + myBullet driver.close() Later on the loop is washed, we have scraped all the subsections separately and stored them into a dictionary. Finally, nosotros just need to loop each cardinal-value pair on this dictionary and concatenate their contents altogether as we loop over the dictionary.
article = '' for central, value in articleDictionary.items(): commodity = article + key + ' \north\north ' + value + ' \n\n *************** \n\n ' print(article) NoSuchElementException
When a web element is not found, it throws the NoSuchElementException. The reason for NoSuchElementException can be whatsoever of the following:
- The way of locating a web element we have adopted doesn't place any element in the HTML DOM.
- The style of locating a web element we have adopted doesn't uniquely identify the desired element in the HTML DOM and currently finds some other hidden / invisible element.
- The mode of locating a web element we have adopted is unable to place the desired element as it is not within the browser's
Viewport.
- The way of locating a web element we take adopted identifies the chemical element but is invisible due to presence of the aspect
style="display: none;".
- The
WebElementwe are trying to locate is within an<iframe>tag.
- The
WebDriverinstance is looking out for theWebElementeven before the element is present/visibile within the HTML DOM.
The solution to address the NoSuchElementException tin exist either of the following:
- When the element we locate does non be in the DOM, use
try-exceptissue handler to avoid the termination of the program:
from selenium.common.exceptions import NoSuchElementException try: elem = driver.find_element_by_xpath("element_xpath") elem.click() except NoSuchElementException: pass This solution is to accost the inconsistency in the DOM amidst the seemingly aforementioned pages.
- When the page loads, for some reason nosotros may be taken to the bottom of the page, but the element we need to scrape is on the top of the folio and thus is out of view. In this state of affairs, we can locate the element in the DOM first, and then use the
execute_script()method to scroll the element into view:
elem = driver.find_element_by_xpath("element_xpath") commuter.execute_script("arguments[0].scrollIntoView();", elem) - In case the element has the aspect
way="display: none;", remove the aspect throughexecute_script()method:
elem = driver.find_element_by_xpath("element_xpath") driver.execute_script("arguments[0].removeAttribute('way')", elem) elem.send_keys("text_to_send") -
Adopt a way of locating a web chemical element which uniquely identifies the desired
WebElement. The preferable method isfind_elements_by_id(), since the id aspect uniquely identifies a spider web element. -
To check if the element is within an
<iframe>, traverse up the HTML to locate the respective<iframe>tag and use theswitch_to()method to shift to the desired iframe through whatsoever of the following approaches:
driver.switch_to.frame("iframe_name") driver.switch_to.frame("iframe_id") commuter.switch_to.frame(1) // 1 represents frame index We can switch back to the main frame by using i of the post-obit methods:
driver.switch_to.default_content() driver.switch_to.parent_frame() A better manner to switch frames would be to induce WebDriverWait() for the availability of the intended frame with expected_conditions fix to frame_to_be_available_and_switch_to_it as in the following examples:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(Past.ID,"id_of_iframe")) // through Frame ID WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(Past.NAME,"name_of_iframe")) // through Frame name WebDriverWait(driver, ten).until(EC.frame_to_be_available_and_switch_to_it(By.XPATH,"xpath_of_iframe")) // through Frame XPath WebDriverWait(commuter, 10).until(EC.frame_to_be_available_and_switch_to_it(By.CSS_SELECTOR,"css_of_iframe")) // through Frame CSS - If the element is not nowadays/visible in the HTML DOM immediately, induce
WebDriverWaitwithexpected_conditionsset to the proper method equally follows:
- To wait for
presence_of_element_located:
element = WebDriverWait(driver, xx).until(expected_conditions.presence_of_element_located((By.XPATH, "element_xpath']"))) - To look for
visibility_of_element_located:
element = WebDriverWait(driver, 20).until(expected_conditions.visibility_of_element_located((By.CSS_SELECTOR, "element_css"))) - To wait for
element_to_be_clickable:
element = WebDriverWait(commuter, 20).until(expected_conditions.element_to_be_clickable((By.LINK_TEXT, "element_link_text"))) Source: https://iqss.github.io/dss-webscrape/finding-web-elements.html
0 Response to "How to Read Value From Webpage Using Selenium"
Post a Comment