Kakšen/kater software (ali SQL rešitev) za hitrejši "full text search"? @ Slo-Tech

Forum » Programiranje »
Kakšen/kater software (ali SQL rešitev) za hitrejši "full text search"?

Kakšen/kater software (ali SQL rešitev) za hitrejši "full text search"?

HotBurek :: 23. feb 2024, 20:03

Dobro jutro.

Evo, fantje in dekline, nov dan, nov izziv.

Tokrat okrog same search funkcijonalnosti.

Trenutno imam MariaDB, Full-Text Search, ter config: innodb_ft_min_token_size = 1, innodb_ft_enable_stopword = 0.

Sam search ni tako gut, ker za input "10.2" jemle kot "10 2". Dokumentacija pravi, da se zadevo da "rešit", kar sem nekaj malega spreminjal, testira, a mi ni uspelo.

https://dev.mysql.com/doc/refman/8.0/en...

Druga stvar je RAM. Nastavljeno imam innodb_buffer_pool_size = 4G ter max_connections = 150. Ampak gre poraba tudi 2x čez 4 giga.

Tretja stvar pa je hitrost. V tabeli je trenutno nekaj manj kot 3 miljone record-ov, in iskanje po krajših stringih se šteje v sekundah. Kar ni gut.

Sedaj razmišljam, kaj naredit. Za vsak row imam searchstring.

geschenkkarton für 2 x 250 ml
kochbuch vergnügen in der küche
der kern premium lavendel

Za vsak row imam tudi kombinacije.

100
100 ml
für
für 2
für 2 x
für 2 x 100
für 2 x 100 ml
geschenkkarton
geschenkkarton für
geschenkkarton für 2
geschenkkarton für 2 x
geschenkkarton für 2 x 100
geschenkkarton für 2 x 100 ml

In to se zbuild-a sproti. Iz tega bi se dalo kaj naredit. Če sem prav zasledil, bi bil naslednji korak “inverted index” or a “postings list”.

Ampak tega se še nisem lotil.

Šel pa sem malo na internete in se razgledal.

ELASTICSEARCH

Verjetno najbolj pogosta izbira. A je primerent, pa ne vem. Da se programirat v Python-u, kar je vredu. To, da pride zraven fakin JAVA, bi mogoče lahko celo spregledal.

https://dylancastillo.co/elasticsearch-...

WHOOSH

V celoti napisano v Python-u, kar je gut. Zgleda, da se je pa dve leti nazaj ustavil razvoj.

https://github.com/mchaput/whoosh/

https://whoosh.readthedocs.io/en/latest...

Bart de Goede blog post

Tegale sem sproti našel. Mogoče kot helper je zanimiv.

https://bart.degoe.de/building-a-full-t...

To je pa to zaenkrat. Skratka, trenutno nimam ideje, kako naprej. Kako rešit, da bo search delal hitro. Da bi search vračal pod 1 sekundo. Vem, da ko sem imel innodb_ft_min_token_size nastavljen na 3, je delal kot šus. Ampak pol krajših besed ne moreš iskat, kar se mi zdi krneki.

V osnovi pa ne iščem komplicirane rešitev. Če bi imel sistem "inverted index"-ov, pa da išče samo po tem, je že to to. Oz. da je tist "row" ki ima največ najdenih besed, pri vrhu.
O tem sem že razmišljal, a še ne znam napisati SQL queryja, ki bi se prilagajal glede na število vnešenih word-ov.

Se pravi, če bi bili row-i sledeči:

0 a
1 ab a d
2 bb c
3 c d a

Pa da je input search "ab a c", da so rezultati v sledečem vrstnem redu:

3 c d a
1 ab a d
0 a
2 bb c

Prva dva imata dva "match"-a, s tem da row 3 višje, ker je krajši string. Isto pa zadnja dva, kjer ima vsak po en "match", s tem da je row 0 višje, ker je krajši.

Zraven bi še imel, da za input "für", se match "für" šteje kot 1, match "fur" pa 0.8.

To je pa to. Nobenih komplikaciji okrog jezikov, stop word-ov, stemming... Vsaj zaenkrat ne. Ker je speed veliko bolj pomemben kot ti advanced fičerji.

root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

spremenilo: HotBurek (23. feb 2024 ob 20:05)

kr?en :: 23. feb 2024, 20:41

Ne.

DeeJay :: 23. feb 2024, 23:51

včasih sem za te zdeve nucal Sphinx.... nism pa že zelo dolg spremljal kaj se dogaja na tem področju
https://sphinxsearch.com/

Don't f with me.

DamijanD :: 24. feb 2024, 05:32

Lahko pogledaš kaj se trenutno dogaja z Lucene - če se ne motim je to engine za Elastic Search. Vem, da obstaja port v c# (Lucene.NET) in lahko sam vse postaviš. ES pa lahko uporabljaš direktno kot out of the box servis v kakšnem dockerju in ti je java in vse ostalo povsem skrito...

krho :: 24. feb 2024, 07:03

https://github.com/meilisearch/meilisea...
https://github.com/typesense/typesense
https://github.com/manticoresoftware/ma... je pa proper fork od bivšega sphinx, odkarso šli čisto v komercialne vode in afair zadeva govori mysql protokol

Vse te variante so v c/rust kar pomeni, da ne boš ob štartu takoj kuril 500M rama, kot ga boš z elasticom

si.Mail odprto-kodni odjemalec elektronske pošte. - http://www.simail.si
Uredite si svojo zbirko filmov, serij in iger - http://xcollect.sf.net

Zgodovina sprememb…

spremenil: krho (24. feb 2024 ob 07:04)

JanBrezov :: 24. feb 2024, 07:57

Pa si poskusil z innodb_ft_min_token_size = 2? Zakaj bi iskal po eni črki, saj to vrne praktično vse.

FireSnake :: 24. feb 2024, 10:48

Jezus Kristus!

Spet zaplaval popolnoma v napačno smer!
Elastic search se uporablja v primeru, ko imaš desetine terabajtov podatkov.
Namenjen zgolj iskanju. Update je zahteven glede resursov in tudi počasen. Če si v oblaku in plačuješ procesorsko moč pa tudi drag.
Par milijonov zapisov ni nič.

Unikatne rešitve ni. Full text search ne bo nikoli hipen.
Lahko se uporabi keširanje. S tem, da se je treba zavedati, da v tem primeru ni mus, da so podatki up to date. Odvisno od implementacije. Treba je tudi vedeti, če to sploh ke rešitev (odvisno od problema).

Otvoritveni post pa zopet poln odvečnega balasta in hkrati manjka kakšen podatek, ki bi zadevo res pojasnil, da bi se dalo razumeti.
Na kratko: da, tudi vprašati je treba znati.

Poglej in se nasmej: vicmaher.si

DeeJay :: 24. feb 2024, 14:23

Jest sm leta nazaj nucal sphinx na ogromni bazi pesmi besedil in je bil search ter sortiranje instantno. Res da niso bli to terabajti podatkov, sam par 100GB je pa blo.

Don't f with me.

sbawe64 :: 24. feb 2024, 19:52

Kako imajo Google& ostali iskalniki urejeno, da vrne search v 1s (in imajo mnogo večjo 'bazo' recordov) ?

2020 is new 1984
Corona World order

Zgodovina sprememb…

spremenilo: sbawe64 (24. feb 2024 ob 19:53)

ReRMh :: 24. feb 2024, 19:54

baje da lasten produkt

sbawe64 :: 24. feb 2024, 20:00

Tole pravi phind.com

To achieve search results as quickly as Google, which often returns results in under a second, you would need to employ advanced search technologies and optimization techniques. Traditional databases like MySQL are not optimized for the speed required for real-time search engines. Instead, you should consider using specialized search engines or technologies designed for fast data retrieval.

1. **Elasticsearch**: Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine-tuned relevancy, and powerful analytics that scale with ease. Elasticsearch is widely used for its speed and efficiency in handling large volumes of data and complex queries.

2. **Apache Solr**: Apache Solr is a powerful, open-source search platform built on Apache Lucene. It's designed to provide fast full-text search, hit highlighting, faceted search, dynamic clustering, and rich document handling. Solr is highly scalable and can handle very large datasets, making it suitable for applications that require fast search responses.

3. **Algolia**: Algolia is a search-as-a-service platform that provides developers with the tools to implement search functionality into their applications. It is known for its speed, scalability, and ease of use. Algolia is designed to handle complex search requirements, including filtering, faceting, and geo-search, making it a powerful alternative for applications needing real-time search capabilities.

4. **Sphinx Search**: Sphinx is a full-text search server, designed with performance, relevance (search quality), and integration simplicity in mind. It is incredibly fast and can index and search through millions of documents in near real-time. Sphinx is particularly well-suited for applications that require full-text search functionality with minimal setup and maintenance.

5. **Vespa**: Vespa is an open-source big data processing and serving engine. It is designed to handle large-scale data streams and provides fast and scalable search, recommendation, and personalization. Vespa is particularly useful for applications that require real-time data processing and serving, such as personalized content delivery, recommendation engines, and AI-driven search experiences.

To achieve search results as fast as Google, you would need to:

- **Index Your Data**: Use a search engine or technology that supports indexing your data efficiently. Indexing allows the search engine to quickly locate and retrieve relevant information without having to scan through every record in your database.

- **Optimize Your Queries**: Ensure your queries are optimized for the search engine you are using. This might involve using specific query syntax or features provided by the search engine to improve search performance.

- **Scale Your Infrastructure**: As your data grows, ensure your search engine or infrastructure can scale to handle the increased load. This might involve adding more servers, optimizing your search engine configuration, or using a cloud-based search service that can automatically scale resources based on demand.

- **Use Caching**: Implement caching to store frequently accessed data in memory, reducing the need to query the database for every search request. This can significantly speed up search response times.

- **Leverage Hardware and Network Optimization**: Ensure your servers are optimized for search performance, including fast CPUs, sufficient memory, and high-speed network connections. Additionally, consider using Content Delivery Networks (CDNs) to reduce latency for geographically distributed users.

By adopting these strategies and technologies, you can significantly improve the speed of your search functionality, bringing it closer to the performance of leading search engines like Google.

2020 is new 1984
Corona World order

Zgodovina sprememb…

spremenilo: sbawe64 (24. feb 2024 ob 20:06)

sbawe64 :: 24. feb 2024, 20:09

Alternativa MySQL

Percona Server for MySQL and MySQL are both popular choices for database management systems, but they offer different features and performance characteristics that may make one more suitable than the other for specific use cases.

**Performance and Scalability:**
- Percona Server for MySQL is designed to offer superior performance, scalability, and instrumentation compared to MySQL. It is trusted by many enterprises for its ability to handle demanding workloads efficiently.
- Percona Server includes enhancements such as InnoDB Resource Groups, configurable page sizes, contention-aware transaction scheduling, and improved scalability by splitting mutexes, which are not available in MySQL.
- It also introduces features like improved MEMORY storage engine, configurable fast index creation, and per-column compression for VARCHAR/BLOB and JSON, which are not present in MySQL.

**Security Features:**
- Both MySQL and Percona Server for MySQL support features like SQL roles, SHA-2 based password hashing, and password rotation policy. However, Percona Server provides additional security features such as the ability to enforce encryption and an audit logging plugin, which are not available in MySQL.

**Operational Improvements:**
- Percona Server introduces operational improvements such as atomic DDL, transactional data dictionary, instant DDL, and SET PERSIST, which are not present in MySQL. It also offers features like backup locks, invisible indexes, and the ability to kill idle transactions, enhancing operational efficiency.

**Instrumentation and Troubleshooting:**
- Percona Server extends MySQL's capabilities with more extensive instrumentation and troubleshooting features, including INFORMATION_SCHEMA tables, global performance and status counters, optimizer histograms, and enhanced SHOW INNODB ENGINE STATUS.

**Compatibility and Open Source:**
- Both MySQL and Percona Server for MySQL are open-source and compatible with each other, making it relatively easy to switch between them.

**Considerations:**
- While Percona Server offers several enhancements over MySQL, it's important to note that some of its features are not available in MySQL and vice versa. The choice between the two should be based on the specific requirements of your project, including performance needs, scalability, security, and operational efficiency.
- Percona Server's additional features and enhancements may require additional configuration and tuning to fully leverage their benefits.

In summary, if your project requires enhanced performance, scalability, and operational efficiency, Percona Server for MySQL could be a better choice than MySQL. However, if you're looking for a more straightforward, widely-used solution, MySQL might be more suitable.

---

For faster alternatives to MySQL, several options are available, each with its own set of advantages and use cases. Here's a summary of some of the most notable alternatives:

1. **PostgreSQL**: Known for its high performance and ability to make predefined fast queries, PostgreSQL is a powerful, open-source object-relational database system. It's available on Windows, Linux, and macOS, making it a versatile choice. PostgreSQL is particularly noted for its robust feature set, including support for complex queries and a wide range of data types [1].

2. **Cassandra**: This is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra is particularly well-suited for applications that can't afford to lose data, even in the event of a machine failure. It's also familiar to developers used to SQL, making the transition easier [1].

3. **MongoDB**: A popular NoSQL database, MongoDB is known for its rapid writing and lightning-fast reading capabilities. It's particularly well-suited for handling large amounts of unstructured data due to its master and slave replication capabilities. MongoDB is an open-source database that offers flexibility in querying and data storage, making it a good choice for applications that require high write loads [2].

4. **MariaDB**: MariaDB is a fork of MySQL and is designed to maintain high compatibility with MySQL while also offering new features and performance improvements. It supports a variety of storage engines and is known for its operational agility, full SQL compliance, and ACID compliance. MariaDB is particularly beneficial for analytical, transactional, or hybrid use cases and is used by many major corporations [3].

When choosing an alternative to MySQL, consider the specific requirements of your project, such as the need for scalability, the type of data you're working with (structured vs. unstructured), and the performance characteristics you require (read vs. write performance). Each of these alternatives has its strengths and may be more suitable for certain types of applications or use cases than others.

2020 is new 1984
Corona World order

Zgodovina sprememb…

spremenilo: sbawe64 (24. feb 2024 ob 20:11)

sbawe64 :: 25. feb 2024, 18:25

Nekaj na temo, ravno prišlo ven

Making It FAST - 1 Billion Row Challenge in Go
milijarda vrstic podatkov 1:49

2020 is new 1984
Corona World order

Zgodovina sprememb…

spremenilo: sbawe64 (25. feb 2024 ob 18:27)

Spura :: 25. feb 2024, 18:34

Dve leti nazaj sem ti napisal, da je brezveze, da implementiras svojo buggy implementacijo search engina in da uporabi ze narejen produkt. In evo nas danes.

FireSnake :: 25. feb 2024, 21:25

Povezavo do te tvoje trditve je možno dobiti?

Poglej in se nasmej: vicmaher.si

HotBurek :: 26. feb 2024, 15:39

Dobro jutro.

Evo, fantje in dekline, zadevo sem rešil z typesense, krho hvala ti za link. Software zgleda top of the top.

Sem pa naletel na en izziv. Za input query imam string: "kaviar kreuzung".

Če search parametre nastavim tako:

search_parameters = {
    "q": "kaviar kreuzung",
    "query_by": "searchname",
    "sort_by": "_text_match:desc,searchname_length:asc"
};

Dobim samo en rezultat, ki ima score: 1157451471441100921

Če pa search parmaterte nastavim takole (dodan filter_by):

search_parameters = {
    "q": "kaviar kreuzung",
    "query_by": "searchname",
    "sort_by": "_text_match:desc,searchname_length:asc",
    "filter_by": "country:us"
};

Pa dobim 14 rezultatov (gre za "caviar" rezultate), od katerih je najvišji score 578730054645710969.

Se pravi:
1157451471441100921
578730054645710969

V SQL-u sem imel preprost sistem: prebral sem max score, ga pomnožil z 0.6, ter vse, ki so imeli manjši score od te vrednosti, sem pred output-om odrezal.

Tu bi lahko naredil podobno. Prvo query brez filtra, preberem score, ga pomnožim z 0.6, ter naredim drug query kjer uporabim filter, ter pri branju preverjam, če score dosega 60% posto od max brez filtra.

Well.... that's how it could be done.

Software pa dela kot šus. Čudno, da Debian nima tega vključenega v njihovih apt source serverjih, isto da nimajo python3-typesense kar pod apt-get install...

Neke malenkosti še morem naredit. Eno je zgenerirat key za search only funkcijonalnost. Drugo je backup database-a.

Ter še ena ne najbolj malenkost. Po restartu serverja (service typesense-server restart) traja dve tri minute, da server dejansko začne vračat podatke. V vmesnem času pa client po kakšnih ~6 sekundah dobi error:

typesense.exceptions.ServiceUnavailable: [Errno 503] Not Ready or Lagging

Zakaj v začetnem času (2~3 minute), ko service že laufa, vrača, da še ni ready, nea vem.

Bi pa rad za client nastavil nižji timeout. Mogoče 1.5 sekunde.

root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

Zgodovina sprememb…

spremenilo: HotBurek (26. feb 2024 ob 15:43)

Spura :: 27. feb 2024, 10:44

FireSnake je 25. feb 2024 ob 21:25 izjavil:

Povezavo do te tvoje trditve je možno dobiti?

Tezko, ker je na slo-techu search samo ali po tekstu ali pa vse objave uporabnika, kej bolj advanced se pa ne da.

FireSnake :: 27. feb 2024, 11:18

HotBurek je 26. feb 2024 ob 15:39 izjavil:

Se pravi:
1157451471441100921
578730054645710969

Score? Ali komu, ki je prebral tudi otvoritveni post (jaz sem zanalašč večkrat) to kaj dodatnega (bolje rećeno: UPORABNEGA) pove?

HotBurek je 26. feb 2024 ob 15:39 izjavil:

prebral sem max score, ga pomnožil z 0.6

WTF? Ali komu tale 0.6 kaj pove?

HotBurek je 26. feb 2024 ob 15:39 izjavil:

Čudno, da Debian nima tega vključenega v njihovih apt source serverjih

Še več balasta. kaj ima to veze na vprašanja, ki jih trol zastavlja?

HotBurek je 26. feb 2024 ob 15:39 izjavil:

isto da nimajo python3-typesense kar pod apt-get install...

Ja?

Pa to še zdaleč ni vse. Nisem vseh neumnosti naštel. Tipu res dogaja v glavi.
Ali mi lahko kdo pove, če se njemu zdi, da ima tole trol pisanje kakšen rep in glavo? Vse je nametano naključno. Miselni izbljuvki so prav neverjetni.

Me lahko kdo, prosim, prepriča, da se motim in burek ni trol?

Poglej in se nasmej: vicmaher.si

HotBurek :: 27. feb 2024, 12:19

To, da so tebe prestavli na pol odprti oddelek, je strokovna napaka psihiatrije.

Jasno sem ti napisal, da ne piši, da sem trol.

FireSnake, ti si klošar. Pober se.

root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

Zgodovina sprememb…

spremenilo: HotBurek (27. feb 2024 ob 12:21)

FireSnake :: 27. feb 2024, 13:46

Postavil sem par tehničnih vprašanj. Kje je težava?

Poglej in se nasmej: vicmaher.si

HotBurek :: 27. feb 2024, 14:02

Težava je v tem, da so te na psihiatriji ven spustil.

Klošar.

root@debian:/# iptraf-ng
fatal: This program requires a screen size of at least 80 columns by 24 lines
Please resize your window

FireSnake :: 27. feb 2024, 14:45

Poleg tega, da si neizobražena furja si tudi primitivec, vidim.
tvoje besede me ne užalijo, če slučajno misliš drugače.
Če bi jih upal povedati v obraz in jaz ne bi naredil nič, potem bi bilo pa drugače.

Tako ti pa svetujem samo eno: malo manj trolanja po forumu in bo za vse bolje.
Pa lep dan še naprej!

Poglej in se nasmej: vicmaher.si

Vredno ogleda ...

	Tema	Sporočila	Ogledi	Zadnje sporočilo
	Tema	Sporočila	Ogledi	Zadnje sporočilo
!	Postavitev Apache serverja s podporo za PHP in MySQL (strani: 1 2 3 4 5 6 7 ) \|CyGNUS-x Oddelek: Izdelava spletišč	322	262129 (6030)	miko22 15. nov 2022 17:31:45
»	problem pri sprembi root gesla v MySQL v Ubuntu killa bee Oddelek: Pomoč in nasveti	23	1372 (952)	killa bee 13. nov 2016 16:27:45
»	MySQL auth bypass denial Oddelek: Informacijska varnost	12	2443 (2176)	technolog 12. jun 2012 11:19:12
»	MySQL in Visual Basic Mercier Oddelek: Programiranje	11	1909 (1711)	krho 8. jul 2002 20:01:36
»	Program za web server? Hojne Oddelek: Omrežja in internet	9	2449 (2308)	Hojne 24. okt 2001 09:34:43

Več podobnih tem

Zadnje novice

Zadnji članki

Išči:

Forum » Programiranje »
Kakšen/kater software (ali SQL rešitev) za hitrejši "full text search"?

Kakšen/kater software (ali SQL rešitev) za hitrejši "full text search"?