Forum » Programiranje » Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".
Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".
HotBurek ::
Dobro jutro.
Evo, nov dan, nov izziv.
Tokrat z uporabo bs4 (Beautiful Soup, Python).
Iz XML dokumenta želim dobit vse elemente z imenom "loc". Problem je, da bs4 tretira (oz. najde) elemente z imenom "image:loc", kot da gre za "loc".
Skratka, kako naredit, da bo bs4 delal v "strict mode" za name property-je?
Sample Python:
Sample XML:
https://www.lindtusa.com/media/sitemap/...
Še v sliki, za zadnji primer iz zgornjega XML-ja:
Evo, nov dan, nov izziv.
Tokrat z uporabo bs4 (Beautiful Soup, Python).
Iz XML dokumenta želim dobit vse elemente z imenom "loc". Problem je, da bs4 tretira (oz. najde) elemente z imenom "image:loc", kot da gre za "loc".
Skratka, kako naredit, da bo bs4 delal v "strict mode" za name property-je?
Sample Python:
soup = bs4.BeautifulSoup(xml_text, features="xml"); items = soup.findAll(); for i in range(0, len(items)): item = items[i]; print(str(item.name) + " " + str(item.text));
Sample XML:
https://www.lindtusa.com/media/sitemap/...
Še v sliki, za zadnji primer iz zgornjega XML-ja:
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
hidetr ::
from bs4 import BeautifulSoup def main(): with open('test.xml', 'r', encoding='utf8') as _f: xml_content = _f.read() soup = BeautifulSoup(xml_content, 'xml') url_elements = soup.find_all('url') for url_element in url_elements: loc_element = url_element.find('loc') if loc_element: print(loc_element.text) if __name__ == "__main__": main()
nisem našel nikjer možnosti za strict mode, tole bi moralo delovati, prvo čez url-je in potem čez loce.(predvidevam, da hočeš samo loce od url-jev)
Zgodovina sprememb…
- spremenil: hidetr ()
HotBurek ::
Ok, ta rešitev v nekem kontekstu dela. Funkcija find() vrne zgolj prvi element, ki je "like loc".
Če je source spodnji primer, ne dela pravilno.
XML source sample:
Če je source spodnji primer, ne dela pravilno.
XML source sample:
xml_text = """ <urlset> <url> <lastmod>2023-12-09T05:01:24+00:00</lastmod> <changefreq>daily</changefreq> <priority>0.2</priority> <image:image> <image:loc>https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1_1.jpg</image:loc> <image:title>LINDOR Brownies</image:title> </image:image> <PageMap xmlns="http://www.google.com/schemas/sitemap-pagemap/1.0"> <DataObject type="thumbnail"> <Attribute name="name" value="LINDOR Brownies"/> <Attribute name="src" value="https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1.jpg"/> </DataObject> </PageMap> <loc>https://www.lindtusa.com/recipes/lindor-brownies</loc> </url> </urlset>""";
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
Zigerion ::
from bs4 import BeautifulSoup def main(): with open('test2.xml', 'r', encoding='utf8') as _f: xml_content = _f.read() soup = BeautifulSoup(xml_content, features="xml") url_elements = soup.find_all("url") for url_element in url_elements: for child in url_element: if child.name == 'loc': print(child.text) break if __name__ == "__main__": main()
HotBurek ::
Ta rešitev deluje, a se pravtako da isto rešit z recursive=False za soup.find().
Res je, da bi načeloma vsi XML-ji morali biti sformatirani pravilno, in da se to (da bi na istem nivoju bila "loc" in "image:loc") nikoli ne bo zgodilo, ampak vseeno.
Ultimativni XML, kjer je treba ven dobit vrednost "loc" element-a (v tem primeru je to 2):
Res je, da bi načeloma vsi XML-ji morali biti sformatirani pravilno, in da se to (da bi na istem nivoju bila "loc" in "image:loc") nikoli ne bo zgodilo, ampak vseeno.
Ultimativni XML, kjer je treba ven dobit vrednost "loc" element-a (v tem primeru je to 2):
xml_text = """ <urlset> <url> <image:loc>1</image:loc> <loc>2</loc> <image:loc>3</image:loc> </url> </urlset>""";
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window
Zgodovina sprememb…
- spremenilo: HotBurek ()
Vredno ogleda ...
Tema | Ogledi | Zadnje sporočilo | |
---|---|---|---|
Tema | Ogledi | Zadnje sporočilo | |
» | Heap vs Stack [C]Oddelek: Programiranje | 3116 (2102) | Vesoljc |
» | Kako shranit ali dobit vse email naslove na tej strani ?Oddelek: Pomoč in nasveti | 565 (479) | #000000 |
» | Skripta za Bolho.Oddelek: Programiranje | 4435 (1847) | planina91 |
» | Bolha parser/pajek - prejemanje obvestil o novih oglasihOddelek: Programiranje | 4350 (3348) | rokko |
⊘ | python pomočOddelek: Programiranje | 3405 (2326) | Mavrik |