Forum » Programiranje » Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".
Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".
HotBurek ::
Dobro jutro.
Evo, nov dan, nov izziv.
Tokrat z uporabo bs4 (Beautiful Soup, Python).
Iz XML dokumenta želim dobit vse elemente z imenom "loc". Problem je, da bs4 tretira (oz. najde) elemente z imenom "image:loc", kot da gre za "loc".
Skratka, kako naredit, da bo bs4 delal v "strict mode" za name property-je?
Sample Python:
Sample XML:
https://www.lindtusa.com/media/sitemap/...
Še v sliki, za zadnji primer iz zgornjega XML-ja:
Evo, nov dan, nov izziv.
Tokrat z uporabo bs4 (Beautiful Soup, Python).
Iz XML dokumenta želim dobit vse elemente z imenom "loc". Problem je, da bs4 tretira (oz. najde) elemente z imenom "image:loc", kot da gre za "loc".
Skratka, kako naredit, da bo bs4 delal v "strict mode" za name property-je?
Sample Python:
soup = bs4.BeautifulSoup(xml_text, features="xml");
items = soup.findAll();
for i in range(0, len(items)):
item = items[i];
print(str(item.name) + " " + str(item.text));Sample XML:
https://www.lindtusa.com/media/sitemap/...
Še v sliki, za zadnji primer iz zgornjega XML-ja:
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
hidetr ::
from bs4 import BeautifulSoup
def main():
with open('test.xml', 'r', encoding='utf8') as _f:
xml_content = _f.read()
soup = BeautifulSoup(xml_content, 'xml')
url_elements = soup.find_all('url')
for url_element in url_elements:
loc_element = url_element.find('loc')
if loc_element:
print(loc_element.text)
if __name__ == "__main__":
main()
nisem našel nikjer možnosti za strict mode, tole bi moralo delovati, prvo čez url-je in potem čez loce.(predvidevam, da hočeš samo loce od url-jev)
Zgodovina sprememb…
- spremenil: hidetr ()
HotBurek ::
Ok, ta rešitev v nekem kontekstu dela. Funkcija find() vrne zgolj prvi element, ki je "like loc".
Če je source spodnji primer, ne dela pravilno.
XML source sample:
Če je source spodnji primer, ne dela pravilno.
XML source sample:
xml_text = """
<urlset>
<url>
<lastmod>2023-12-09T05:01:24+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.2</priority>
<image:image>
<image:loc>https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1_1.jpg</image:loc>
<image:title>LINDOR Brownies</image:title>
</image:image>
<PageMap xmlns="http://www.google.com/schemas/sitemap-pagemap/1.0">
<DataObject type="thumbnail">
<Attribute name="name" value="LINDOR Brownies"/>
<Attribute name="src" value="https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1.jpg"/>
</DataObject>
</PageMap>
<loc>https://www.lindtusa.com/recipes/lindor-brownies</loc>
</url>
</urlset>"""; root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
Zigerion ::
from bs4 import BeautifulSoup
def main():
with open('test2.xml', 'r', encoding='utf8') as _f:
xml_content = _f.read()
soup = BeautifulSoup(xml_content, features="xml")
url_elements = soup.find_all("url")
for url_element in url_elements:
for child in url_element:
if child.name == 'loc':
print(child.text)
break
if __name__ == "__main__":
main()
HotBurek ::
Ta rešitev deluje, a se pravtako da isto rešit z recursive=False za soup.find().
Res je, da bi načeloma vsi XML-ji morali biti sformatirani pravilno, in da se to (da bi na istem nivoju bila "loc" in "image:loc") nikoli ne bo zgodilo, ampak vseeno.
Ultimativni XML, kjer je treba ven dobit vrednost "loc" element-a (v tem primeru je to 2):
Res je, da bi načeloma vsi XML-ji morali biti sformatirani pravilno, in da se to (da bi na istem nivoju bila "loc" in "image:loc") nikoli ne bo zgodilo, ampak vseeno.
Ultimativni XML, kjer je treba ven dobit vrednost "loc" element-a (v tem primeru je to 2):
xml_text = """
<urlset>
<url>
<image:loc>1</image:loc>
<loc>2</loc>
<image:loc>3</image:loc>
</url>
</urlset>"""; root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
Zgodovina sprememb…
- spremenilo: HotBurek ()
Vredno ogleda ...
| Tema | Ogledi | Zadnje sporočilo | |
|---|---|---|---|
| Tema | Ogledi | Zadnje sporočilo | |
| » | Heap vs Stack [C]Oddelek: Programiranje | 4215 (3201) | Vesoljc |
| » | Kako shranit ali dobit vse email naslove na tej strani ?Oddelek: Pomoč in nasveti | 634 (548) | #000000 |
| » | Skripta za Bolho.Oddelek: Programiranje | 4876 (2288) | planina91 |
| » | Bolha parser/pajek - prejemanje obvestil o novih oglasihOddelek: Programiranje | 4837 (3835) | rokko |
| ⊘ | python pomočOddelek: Programiranje | 3611 (2532) | Mavrik |
