Skip to content Skip to sidebar Skip to footer

Python Scrape Value Between Static Html Tags Containing Static Text

This is my first post in this forum and i believe that this forum would answer my basic question here. My requirement here consists of two steps. In the first step, i need to ext

Solution 1:

import re

data = """<SPANCLASS="c8">DOCUMENT-TYPE: </SPAN><SPANCLASS="c2">**Paid Death Notice**</SPAN><SPANCLASS="c8">PUBLICATION-TYPE: </SPAN><SPANCLASS="c2">Newspaper</SPAN><SPANCLASS="c8">DOCUMENT-TYPE: </SPAN><SPANCLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
           """
pattern="\<SPANCLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPANCLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]

Output:

['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']

Solution 2:

Code:

from bs4 import BeautifulSoup

data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
       <SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""

soup = BeautifulSoup(data,'lxml')
doc = soup.find('span',class_='c8')
print(doc.text)

Result:

DOCUMENT-TYPE:

Solution 3:

You can use findall method from re module, and regular expression.

Example:

import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
       <SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
       <SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
       """
data = data.replace('\n',' ')
res = re.findall("""<SPAN *CLASS="c8"> *([^:<]+): *</SPAN> *<SPAN *CLASS="c2">([^<]*)</SPAN>""", 
             data, 
             re.IGNORECASE
    )
print res
print"\n".join([ "%s: %s" % (item[0],item[1]) for item in res ])

Output: [('DOCUMENT-TYPE', '**Paid Death Notice**'), ('PUBLICATION-TYPE', 'Newspaper'), ('DOCUMENT-TYPE', 'Paid Notice: Deaths THORNTON, ROBERT')] DOCUMENT-TYPE: **Paid Death Notice** PUBLICATION-TYPE: Newspaper DOCUMENT-TYPE: Paid Notice: Deaths THORNTON, ROBERT

You can simply get the res variable and get all keys and values. If you would like to convert the result to dictionary you can use this code:

res_dict = dict(res)
print res_dict

but in that case, the first 'DOCUMENT-TYPE' occurrence will be overrided, by the last one:

{'DOCUMENT-TYPE': 'Paid Notice: Deaths THORNTON, ROBERT', 'PUBLICATION-TYPE': 'Newspaper'}

Solution 4:

Do not mix regexes and BeautifulSoup, BS has enough methods to navigate DOM tree:

if doc.text.startswith('DOCUMENT-TYPE'):
    print doc.find_next_sibling().text

# prints **Paid Death Notice**

You can also iterate on all tags with particular property:

for tag in soup.find_all('span', class_='c8'):
    print tag.text

# DOCUMENT-TYPE:# PUBLICATION-TYPE:

Post a Comment for "Python Scrape Value Between Static Html Tags Containing Static Text"