» »

Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".

Python: Zakaj bs4 tretira "image:loc" kot "loc"? Oz. kako vklopit "stric mode".

HotBurek ::

Dobro jutro.

Evo, nov dan, nov izziv.

Tokrat z uporabo bs4 (Beautiful Soup, Python).

Iz XML dokumenta želim dobit vse elemente z imenom "loc". Problem je, da bs4 tretira (oz. najde) elemente z imenom "image:loc", kot da gre za "loc".

Skratka, kako naredit, da bo bs4 delal v "strict mode" za name property-je?

Sample Python:

soup = bs4.BeautifulSoup(xml_text, features="xml");

items = soup.findAll();

for i in range(0, len(items)):

    item = items[i];

    print(str(item.name) + " " + str(item.text));

Sample XML:

https://www.lindtusa.com/media/sitemap/...

Še v sliki, za zadnji primer iz zgornjega XML-ja:

root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

hidetr ::

from bs4 import BeautifulSoup

def main():
    with open('test.xml', 'r', encoding='utf8') as _f:
        xml_content = _f.read()

    soup = BeautifulSoup(xml_content, 'xml')

    url_elements = soup.find_all('url')

    for url_element in url_elements:
        loc_element = url_element.find('loc')
        

        if loc_element:
            print(loc_element.text)

if __name__ == "__main__":
    main()

nisem našel nikjer možnosti za strict mode, tole bi moralo delovati, prvo čez url-je in potem čez loce.(predvidevam, da hočeš samo loce od url-jev)

Zgodovina sprememb…

  • spremenil: hidetr ()

HotBurek ::

Ok, ta rešitev v nekem kontekstu dela. Funkcija find() vrne zgolj prvi element, ki je "like loc".

Če je source spodnji primer, ne dela pravilno.

XML source sample:
xml_text = """
<urlset>
    <url>
        <lastmod>2023-12-09T05:01:24+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.2</priority>
        <image:image>
            <image:loc>https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1_1.jpg</image:loc>
            <image:title>LINDOR Brownies</image:title>
        </image:image>
        <PageMap xmlns="http://www.google.com/schemas/sitemap-pagemap/1.0">
            <DataObject type="thumbnail">
                <Attribute name="name" value="LINDOR&#x20;Brownies"/>
                <Attribute name="src" value="https://www.lindtusa.com/media/recipe/lindt-the-season-hero-make-lindor-goody-bar_1.jpg"/>
            </DataObject>
        </PageMap>
        <loc>https://www.lindtusa.com/recipes/lindor-brownies</loc>
    </url>
</urlset>""";
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

Zigerion ::

from bs4 import BeautifulSoup

def main():
    with open('test2.xml', 'r', encoding='utf8') as _f:
        xml_content = _f.read()

    soup = BeautifulSoup(xml_content, features="xml")

    url_elements = soup.find_all("url")

    for url_element in url_elements:
        for child in url_element:
            if child.name == 'loc':
                print(child.text)
                break  

if __name__ == "__main__":
    main()

HotBurek ::

Ta rešitev deluje, a se pravtako da isto rešit z recursive=False za soup.find().

Res je, da bi načeloma vsi XML-ji morali biti sformatirani pravilno, in da se to (da bi na istem nivoju bila "loc" in "image:loc") nikoli ne bo zgodilo, ampak vseeno.

Ultimativni XML, kjer je treba ven dobit vrednost "loc" element-a (v tem primeru je to 2):

xml_text = """
<urlset>
    <url>
        <image:loc>1</image:loc>
        <loc>2</loc>
        <image:loc>3</image:loc>
    </url>
</urlset>""";
root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

Zgodovina sprememb…

  • spremenilo: HotBurek ()


Vredno ogleda ...

TemaSporočilaOglediZadnje sporočilo
TemaSporočilaOglediZadnje sporočilo
»

Heap vs Stack [C]

Oddelek: Programiranje
363119 (2105) Vesoljc
»

Kako shranit ali dobit vse email naslove na tej strani ?

Oddelek: Pomoč in nasveti
7568 (482) #000000
»

Skripta za Bolho.

Oddelek: Programiranje
304437 (1849) planina91
»

Bolha parser/pajek - prejemanje obvestil o novih oglasih

Oddelek: Programiranje
214351 (3349) rokko

python pomoč

Oddelek: Programiranje
393407 (2328) Mavrik

Več podobnih tem