HTML Parsing in Python 3.4 using LXML

LXML is a nice little document parser for lightweight and effective HTML/XML parsing without using regular expressions. The module can be installed with relative ease using pip and works for Python 2 and 3. Let’s get the token and expire form values from NYTimes site for an example.

Installation of LXML


# Install lxml using pip3
pip3 install lxml

# Verify it
pip3 list

Using LXML


# Import LXML parser
import lxml.html
import requests

# Use requests library to get the URL
htmlstr = requests.get('https://myaccount.nytimes.com/auth/login/?URI=http://www.nytimes.com/2014/09/13/opinion/on-long-island-a-worthy-plan-for-coastal-flooding.html?partner=rss')

# Create an HTML tree
htmltree = lxml.html.document_fromstring(htmlstr.content)

# Use XPath to get Token value
for input_el in htmltree.xpath("//input[@name='token']/@value"):
 token_val = input_el

# Use XPath to get Expires value
for input_el_2 in htmltree.xpath("//input[@name='expires']/@value"):
 expires_val = input_el_2

# Printing it all out
print (token_val)
print (expires_val)

Result

If all went well, you should see something like this on your terminal:


0f5d2c48c813aeaaccf1bc3e68fbda53dd691bca99fc8d27e864b041e534cc9f1c8a837cab3f9e70a5fc1852097f23ecd67cc58b29a2b654ea7b925e91b0addf4726ed43bbe82baf6e8c0f179a2198362fa55dc724cebb9f41f794bee6ec767410aafdfba9495716e059d649ee2c68edc82131f1f5b08681024d881fe38920c7ea8ca44c4b4a190122718f2123238b76d758825d422aeda868942f0d17c331d157e2130e58c97d61a5aa24399b88bcedfa910000c68fd66415f96aea74f44731a1e8c92cadb747bc77bdeacdbc943fa483aa1708617400ee2255f63f6a768f5d701444db2fa484928719c52bb943a5264ec96175e9f06572717343282f89d9de
1414572834

About Ali Gajani

Hi. I am Ali Gajani. I started Mr. Geek in early 2012 as a result of my growing enthusiasm and passion for technology. I love sharing my knowledge and helping out the community by creating useful, engaging and compelling content. If you want to write for Mr. Geek, just PM me on my Facebook profile.

Ali Gajani Python 0

By Ali Gajani

HTML Parsing in Python 3.4 using LXML

About Ali Gajani