» »

Cloudflare pojasnil vzroke za izpad

Cloudflare pojasnil vzroke za izpad

Slo-Tech - Cloudflare je v obširni objavi pojasnil, kaj je šlo včeraj narobe, da je dobršen del interneta obstal. Izpad se je začel včeraj ob 12.20 po slovenskem času, ko so obstale številne strani, ki uporabljajo Cloudflarovo infrastrukturo za varovanje. Poudarili so, da izpad ni bil posledica napada ali kakšne druge namerne aktivnosti, temveč je šlo za tehnično težavo. Povzročila jo je posodobitev pravic v podatkovni bazi, kar je vplivalo na rezultate poizvedb. Večji del težav so odpravili do 15.30, sistem pa je bil popolnoma operativen ob 18.06.

Sprva se je zdelo, da so težave posledica velikega napada DDoS, a so kmalu odkrili pravi vzrok. Zagodel jim jo je servis Bot Management, ki med drugim za vsak zahtevek v omrežju oceni, s kolikšno verjetnostjo gre za bota. Za svoje delovanje potrebuje datoteko, ki vsebuje značilne lastnosti botov in druge informacije, ki se jih je sistem naučil doslej. Ta datoteka se osvežuje vsakih nekaj minut in nato propagira skozi omrežje. Zaradi napake pri nastavitvi pravic pa je datoteka zrasla na dvojno velikost, saj je imela podvojene vrstice. To je povzročilo vse nadaljnje težave, ker sistem ni pričakoval prevelike datoteke. Strežniki so začeli vračati napake 5xx.

Podroben opis in časovnica sta na spletni strani.


12 komentarjev

manic ::

Error code 522, tako kot ravno sedaj? :O
Techno inside!!!

V-i-p ::

Jaz sem še danes zjutraj ob 09h dobil obvestilo: "Sporočamo vam, da smo iz Bankarta prejeli obvestilo, da prihaja do motenj pri izvajanju plačil pri spletnih trgovcih s Flikom. Razlog motenj je težava pri zunanjem ponudniku Cloudfare. Vzrok raziskujejo in odpravljajo."
Kar lahko storiš danes, ne odlašaj na jutri. Raje reci, da si naredil že včeraj!

sbawe64 ::

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a "feature file" used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.


"Nekdo" se je igral dovoljenji v DB ?

Zakaj ?
2020 is new 1984
Corona World order

Zgodovina sprememb…

  • spremenilo: sbawe64 ()

crniangeo ::

sbawe64 je izjavil:


"Nekdo" se je igral dovoljenji v DB ?

Zakaj ?

Verjetno je nekdo šparal na kadru pa so dali nekomu pravice, in potem je le-ta delal na bazi s pomočjo chatgptja :)

Ko pa je chatgpt mrknil, pa mu ni mogel več pomagati pri troubleshootanju :D
Convictions are more dangerous foes of truth than lies.

Glugy ::

Najprej je oblak ene korporacije klecnil, zdej pa oblak druge korporacije z minimalno zamudo enega meseca. Gre za naključje?

codeMonkey ::

A potem ni Rust kriv :))

Bezno sem ujel napade na navdusence nad Rustom glede tega izpada, ces kako da ga ni preprecil, kjer sem razumel da se nekomu ni zdelo vredno pokriti ene napake in je lepo uporabil unwrap().

Zgodovina sprememb…

Oberyn ::

Se pravi, ena piškava datoteka s podvojenimi vrsticami je ustavila tretjino svetovnega interneta?
Se pravi, ena piškava datoteka s podvojenimi vrsticami je ustavila tretjino svetovnega interneta?
In to v svetu, ki vsako sekundo pridela nov petabajt podatkov?
In to v svetu, ki vsako sekundo pridela nov petabajt podatkov?
Well, resnično sem poskusil, ampak zgleda mi ni ratalo... ta prekleta reč še vedno deluje...
Well, resnično sem poskusil, ampak zgleda mi ni ratalo... ta prekleta reč še vedno deluje...

LiquidAI ::

Ta dogodek je razgrnil napake na več nivojih v korporaciji.

    Sprememba pravic v ClickHouse. S testiranjem in QA bi na staging okolju odkril napako.


In production-quality code, most Rustaceans choose expect rather than unwrap and give more context about why the operation is expected to always succeed. That way, if your assumptions are ever proven wrong, you have more information to use in debugging.


    CEO je sporočil še da so ročno vstavili datoteko v čakalno vrsto, ker nimajo avtomatiziranega procesa ki bi znal vrniti datoteko "feature file" na prejšno znano-varno različico.

As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we'd been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn't make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That's a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.
65001

Zgodovina sprememb…

  • spremenilo: LiquidAI ()

LiquidAI ::

    query je napisan napačno saj nima eksplicitno določene baze zato so pač duplikati iz baze default in r0


Ironija je da so pred kratkim naredili rewrite Rust kode
Over 100 engineers have worked on FL2, and we have over 130 modules. And we're not quite done yet. We're still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.
65001

bm1973 ::

Najbrž nek Idijec v vibe codingom in AI.

Navadite se, da bo tega vedno več.

c3p0 ::

"After an hour we ruled out the DDoS theory"

Če uro rabiš, da ugotoviš, da ni DDoS-a, imaš velik problem.

Cr00k ::

del codebasea so v Rustu napisal, pa evo ga... :D


Vredno ogleda ...

TemaSporočilaOglediZadnje sporočilo
TemaSporočilaOglediZadnje sporočilo
»

Outlook - Na koncu z živci (Telekom - izpad sistema, reset strežnikov) (strani: 1 2 )

Oddelek: Pomoč in nasveti
954332 (2000) MrStein
»

Izsiljevalski virusi napadajo NAS-e

Oddelek: Novice / Varnost
4212097 (7337) ManDriver
»

Why AV companies failed (strani: 1 2 )

Oddelek: Informacijska varnost
5211058 (9496) fosil
»

Popustil najšibkejši člen Slo-Techa (strani: 1 2 3 4 )

Oddelek: Novice / Varnost
19320499 (13155) jype

Več podobnih tem