Skip to content Skip to sidebar Skip to footer

How Do I Extract A Dictionary From A Cdata Embedded In Html?

I used python to scrape an HTML file, but the data I really need is embedded in a CDATA file. My code: import requests from bs4 import BeautifulSoup url='https://www.website.com' p

Solution 1:

This example will print string inside the <script> tag and then parses the data with re/json module:

import re
import json
from bs4 import BeautifulSoup


txt = '''<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
</script>
</div>
'''

soup = BeautifulSoup(txt, 'html.parser')

# select desired <script> tag
script_tag = soup.select_one('#react-container script')

# print contents of the <script> tag:print(script_tag.string)

# parse the json data inside <script> tag to variable 'data'
data = json.loads( re.search(r'window\.REACT_OPTS = ({.*})', script_tag.string).group(1) )

# print data to screen:print(json.dumps(data, indent=4))

Prints:

//<![CDATA[window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>

{
    "components": [
        {
            "component_name": "",
            "props": {},
            "router": true,
            "redux": true,
            "selector": "#react-container",
            "ignoreMissingSelector": false
        }
    ]
}

Post a Comment for "How Do I Extract A Dictionary From A Cdata Embedded In Html?"