We are moving our forum in GitHub Discussions. For questions about Phalcon v3/v4 you can visit here and for Phalcon v5 here.

Phalcon Sharding VS MySQL Cluster


Dear community, I have a question about high-load development with Phalcon.

Phalcon PHP provides instruments to set the correct sources for the Model's data-tabels. But, internally, Phalcon does not work with "JOIN" when the tables for the SQL-query located on different database-servers ( http://forum.phalcon.io/discussion/398/splitting-data-over-multiple-databases ). So, to resolve "miltiple databases JOIN" problem, I need to use (for example) 2 queries for 2 tables and then join queries results with php-code. It works nice and fast.

But there is nice thing called "MySQL Cluster", and it can accept 1 big-join-query and collect data from multiple database nodes. http://www.clusterdb.com/mysql-cluster/70x-faster-joins-with-aql-in-mysql-cluster-7-2/

Which method is scalable and preferable?


Can someone give me their opinion on this?


I think MySQL Cluster is a more mature, faster and optimized feature than any solution developed by us (as well as MongoDB sharding) since I assume this reduce the I/O bound of accessing every node one by one by the application, however it clearly only works on MySQL if it's expected to support only this database system in your application then is ok. Phalcon's approach is intended to allow implement this logic from the database layer to the application layer benefiting from more freedom and less specific depending on database system features.

100% agree with Andres, don't reinvent the wheel


Phalcon, thanx a lot for the reply!

So, to create fast and scalable app I need to do the following:

  • Determine the parts of the data that must be stored (in relational database /memory / somewhere else), depending on storage engine features. For example to use power of MySQL Cluster for the one part of the data and PostgreSQL for the other.
  • Create separate queries for the every "isolated" part of the data.
  • Join the "isolated" queries results via php-code.

So, I can get freedom to change storage engine of "isolated" parts and get realy fast access to the data.

The truth lies somewhere in the middle. Does it?


doit 76, as I know "reinvention of the wheel" (separate queries in the application layer for the every table) was realized in very high-load projects. So, I tried to understand why they did so, if there is clusterization of storage "from the box".

Do it when you need it, don't think too much before you need it. Analyze and prepare when you know you get snowball. Look at http://highscalability.com/ there are lots of case studies. Learn from case studies like this http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of

"Keep it very simple Don’t re-invent the wheel Go with proven and solid technologies when you can"


doit76, 37signals with ReWork and so on told me the same :)

But I'm creating a complex business-oriented systems. And I'm doing this a long time ago. Always my key of success was to create scalable model and architecture before coding (even before creating small protoypes).

Maybe in small service there is no matter what architecture you used in the first release (later, you can improve it step by step). But, in complex systems, the model plays very big part, because in the next steps you will need to improve busines-logic, that based on your architecture (and here is he: "snowball"). Sure there is way to update an architecture and to improve a logic in parallel, but it is very expensive and risky way. // I know a very popular platform that now cannot be enchanced, because there are already many systems realized based on it. There was one try to update an architecture level - it was very painfull for the logic-developers and very-very expensive for the company that owns this platform. Now, this platform, has a dubious future.

Tell me what kind of system do you work on? Is it only DB problems? Master/Slaves can't handle your application? Maybe it will work for you http://www.percona.com/software/percona-xtradb-cluster Maybe you should invest more time to design API for your system and add your partners access to API, and handle business logic behind API? I don't know, because i don't know what you system do. In case of API best resource of information i know is http://apigee.com/


Now I'm working on very specific ERP/CRM system with the strong real-time billing. It must works in many Europe and Russia regions. It must be available 24/7. It even must have no dependencies from the internet connection (so there is will be a lot of replicated data-centers). This system will be used by employees, by suppliers, by peoples, by other integrated services. There is a lot-lot-lot of logic. But business-logic and API almost described, now we need to design fast and scalable storage architecture for the data. Sure I don't want to implement storage-dependencies in the application level, but if the speed will need such thing - one more level will be created.

P.S. Sure API - is main part of the system and it almost designed too (yes, it is the "clear REST"), there will be a lot-lot of different interfaces (not only web). But API and so on - it is not about my today's task - it is about development at all.


One more off-top thing about forum: When I pressed "Add comment" - forum told me that "user id required", but my post was not shown me - I thought that I'm "lost" my post (sure, for one second). I opened chrome's "Developer tools", refreshed the page (sent the POST-query), and in the panel "Network" in the tab "Headers" found the text of my post. Maybe there is way save the text of the post in the session or in the browser's local-storage?

didn't you think about using Amazon, Google infrastructure for your job?


Amazon and Google has pretty good solutions, I'm using them in my other projects. But there is 3 key factors for using own infrastructure:

  1. System must work without any internet connection in the branches of the company.
  2. System will work with real money traffic, so top-management don't want to use "cloud" solutions for datacenters.
  3. We already produced such systems with own instruments. Now I have some time and I'm just expiriementing with new storage infrastrucnure to increase speed characteristics.
  1. As I understand your system is network of LAN systems, which works inside LAN if no internet connection? When internet is on this systems exchange data with main system?
  2. Yes it is problem for real money.
  3. Probably better would be use clone of Google BigTable like HBase?

  1. Yes it is. It is huge network of LAN systems. When Internet is on - all devices (workstations, terminals and so on) are connected to the main datacenter, when internet is off - they are connected to the local datacenter (the data are replicated between datacenters all time when internet is on). The goal is to get the actual data in real-time and to provide an opportunity for local branches work all time (without internet-dependency).
  2. Yes, the company has enough money for construct new datacenters (they already have good ones).
  3. Maybe you are right. We already worked with MySQL Cluster (and with other "similar" RDBMS). But we didn't worked yet with GoogleBigTable and modern (Not only SQL) data-storages for ordinary ERPs architectures (sure, we used NoSQL for the communications or for the fast-storage and so on, but main "sceleton" always was the relational database model).
From HBase docs:
9.1.2. When Should I Use HBase?
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

But we really need all this typed columns, indexes, transactions, queries and all ordinary RDBMS features. So I think - to move our main development principles to the NoSQL - it is like move all our ideology in production buisnes-oriented systems to the something new. Isn't it?

If you have got working system better to stay with mysql, i am sure that always is a way to solve any problem with you have own. First i would try mysqlproxy + LUA scripts, you can achieve what you ask here probably with this solution. It is bit old article but o'reilly (2007) http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html Second i would try to consult with Percona specialists, they have good reputation, and i am sure that they find solution.

You have got big system, we can't find solution here. Better to pay specialists for consultation before armagedon :)


Thank you, a lot! In my practice there wasn't any armageddons. And, as I told earlier, I'm already produced similar systems witn MySQL (and MsSQL and PostreSQL and so on) and many-many other features :)

My first question was about using "glue-level" inside the application. Some experienced people told me that it may works better than native RDBMS clusterization (by scalability reasons). In this project I want try to use some features of "phalcon" (such as ORM and controller) because it looks pretty fast and simple for maintaince. Then I saw that "Phalcon" has multidatabase sharding features by default... So, I asked here, in phosforum, opinion about RDMS-native-clusterization and about app-level-clusterization.