Web filesystem layout change

Discussion in 'Developers' Forum' started by ispcomm, Aug 26, 2010.

  1. ispcomm

    ispcomm New Member

    I am trying to imagine a large web host running a 5 figures numbers of sites on top of ispconfig and clustered filesystem storage (don't do this at home!).

    Normally hosts solve this problem by installing more servers, each hosting a number of sites and perhaps 2-server clusters for high availability and processing power.

    What if the whole directory tree was available to all servers of the host and what if each server could serve any of the hosts of the hoster? Load balancing on the fly would be possible and pushing low-traffic sites to weaker machines would be possible (by means of linux-vserver or firewall load balancing tricks).

    A problem is that such a host would need to be very efficient on resource usage and have high degree of redundancy.

    I would also need to share the web tree across the "workers" cluster of servers

    I'm wondering if the current filesystem layout on web directories will hit it's limit at the symlink level when a directory approaches the 100K+ mark of symlinks... for each access to a file this giant directory will need to be scanned (btree/hash is fast... but) and there will be millions of small images that will be served.

    Normally this kind of stuff is solved by just using 2+ levels of indirection, effectively using the filesystem as a search tree:

    /var/sites/f/fi/fir/first-domain.com
    /var/sites/s/se/sec/second-domain.com

    You get the idea...

    There are a number of advantages to adopt such a structure which should be obvious.

    I'd like to open a discussion on the implications of such a structure and the possible implementation for ispconfig.

    ispcomm
     
  2. till

    till Super Moderator

    I dont see any bigger problems to use such a path with the current ispconfig system. To implement that, you would basically do the following steps.

    1) Define a new placeholder for the "f/fi/fir/first-domain.com" path of the domain path that you can use in the paths and symlinks in the server config settings. e.g. name it [website_splitted_domain]
    2) Implement the logic to replace the placeholder when the website path is set in the interface/web/sites/web_domain_edit.php file and the interface/lib/plugins/sites_web_domain_plugin.inc.php file.
    3)
     
  3. ispcomm

    ispcomm New Member

    ok.. it's on my todo list now :)
     
  4. ispcomm

    ispcomm New Member

    Implemented: patch attached

    I did some coding this morning and implemented the "splitter" placeholder. The placeholders are 4 and are called [website_domain_1], [website_domain_2], [website_domain_3] and [website_domain_4].

    They split the path in t/testsite.com trough t/te/tes/test/testsite.com.
    4 levels should be enough for anybody (uhm... have a dejavu feeling here).

    I discovered that there's a module interface/lib/plugins/sites_web_domain_plugin.inc.php which basically overrides all settings (but duplicates them nonetheleast). I have implemented the patch there too, duplicating some code. I added TODO notes as I don't understand (yet) the need for that module.

    The code is tested in my installation and works fine. Apache is also correctly configured so I guess it's OK.

    Please advice me when/if this is included in the trunk (patch is agains latest trunk as of today).

    Code:
    # svn diff
    Index: interface/lib/plugins/sites_web_domain_plugin.inc.php
    ===================================================================
    --- interface/lib/plugins/sites_web_domain_plugin.inc.php	(revision 1947)
    +++ interface/lib/plugins/sites_web_domain_plugin.inc.php	(working copy)
    @@ -10,6 +10,22 @@
     	var $plugin_name        = 'sites_web_domain_plugin';
     	var $class_name         = 'sites_web_domain_plugin';
     
    +	// TODO: This function is a duplicate from the one in interface/web/sites/web_domain_edit.php
    +	//       There should be a single "token replacement" function to be called from modules and
    +	//	 from the main code.
    +        // Returna a "w/we/web/website.ext" path from "website.ext" domain name
    +        function domain_name_split($domain_name,$levels) {
    +                $domain = strtok( $domain_name, "." ) ;         // get first part only
    +                $path = "";
    +                for ( $i = 1 ; $i <= $levels ; $i ++ ) {
    +                        $path .= substr( $domain, 0, $i ) . "/"  ;
    +                }
    +                $path .= $domain_name ;
    +                return $path ;
    +        }
    +
    +
    +
         /*
                 This function is called when the plugin is loaded
         */
    @@ -38,7 +54,14 @@
             // Get configuration for the web system
             $app->uses("getconf");        
             $web_config = $app->getconf->get_server_config(intval($page_form->dataRecord['server_id']),'web');            
    +	// TODO: This code is a duplicate from interface/web/sites/web_site_edit.php (there should be only 1).
             $document_root = str_replace("[website_id]",$page_form->id,$web_config["website_path"]);
    +print_r($web_config);
    +        $document_root = str_replace("[website_domain_1]",$this->domain_name_split($page_form->dataRecord['domain'],1),$document_root);
    +        $document_root = str_replace("[website_domain_2]",$this->domain_name_split($page_form->dataRecord['domain'],2),$document_root);
    +        $document_root = str_replace("[website_domain_3]",$this->domain_name_split($page_form->dataRecord['domain'],3),$document_root);
    +        $document_root = str_replace("[website_domain_4]",$this->domain_name_split($page_form->dataRecord['domain'],4),$document_root);
    +
             // get the ID of the client
             if($_SESSION["s"]["user"]["typ"] != 'admin' && !$app->auth->has_clients($_SESSION['s']['user']['userid'])) {                    
                 $client_group_id = $_SESSION["s"]["user"]["default_group"];
    @@ -53,11 +76,16 @@
             // Set the values for document_root, system_user and system_group
             $system_user 				= $app->db->quote('web'.$page_form->id);
             $system_group 				= $app->db->quote('client'.$client_id);
    +	// TODO: Isn't this a duplication of the code above???
             $document_root 				= $app->db->quote(str_replace("[client_id]",$client_id,$document_root));
    +        $document_root = str_replace("[website_domain_1]",$this->domain_name_split($page_form->dataRecord['domain'],1),$document_root);
    +        $document_root = str_replace("[website_domain_2]",$this->domain_name_split($page_form->dataRecord['domain'],2),$document_root);
    +        $document_root = str_replace("[website_domain_3]",$this->domain_name_split($page_form->dataRecord['domain'],3),$document_root);
    +        $document_root = str_replace("[website_domain_4]",$this->domain_name_split($page_form->dataRecord['domain'],4),$document_root);
             $php_open_basedir 			= str_replace("[website_path]",$document_root,$web_config["php_open_basedir"]);
             $php_open_basedir 			= $app->db->quote(str_replace("[website_domain]",$page_form->dataRecord['domain'],$php_open_basedir));
             $htaccess_allow_override 	= $app->db->quote($web_config["htaccess_allow_override"]);
             $sql = "UPDATE web_domain SET system_user = '$system_user', system_group = '$system_group', document_root = '$document_root', allow_override = '$htaccess_allow_override', php_open_basedir = '$php_open_basedir'  WHERE domain_id = ".$page_form->id;
     		$app->db->query($sql);
    -	}
    -}              	
    \ No newline at end of file
    +	}	
    +}              	
    Index: interface/web/sites/web_domain_edit.php
    ===================================================================
    --- interface/web/sites/web_domain_edit.php	(revision 1947)
    +++ interface/web/sites/web_domain_edit.php	(working copy)
    @@ -251,6 +251,17 @@
     		parent::onShowEnd();
     	}
     
    +	// Returna a "w/we/web/website.ext" path from "website.ext" domain name
    +	function domain_name_split($domain_name,$levels) {
    +		$domain = strtok( $domain_name, "." ) ;		// get first part only
    +		$path = "";
    +		for ( $i = 1 ; $i <= $levels ; $i ++ ) {
    +			$path .= substr( $domain, 0, $i ) . "/"  ;
    +		}
    +		$path .= $domain_name ; 
    +		return $path ;
    +	}
    +
     	function onSubmit() {
     		global $app, $conf;
     
    @@ -345,6 +356,10 @@
     		$web_rec = $app->tform->getDataRecord($this->id);
     		$web_config = $app->getconf->get_server_config(intval($web_rec["server_id"]),'web');
     		$document_root = str_replace("[website_id]",$this->id,$web_config["website_path"]);
    +		$document_root = str_replace("[website_domain_1]",$this->domain_name_split($web_rec['domain'],1),$document_root);
    +		$document_root = str_replace("[website_domain_2]",$this->domain_name_split($web_rec['domain'],2),$document_root);
    +		$document_root = str_replace("[website_domain_3]",$this->domain_name_split($web_rec['domain'],3),$document_root);
    +		$document_root = str_replace("[website_domain_4]",$this->domain_name_split($web_rec['domain'],4),$document_root);
     
     		// get the ID of the client
     		if($_SESSION["s"]["user"]["typ"] != 'admin' && !$app->auth->has_clients($_SESSION['s']['user']['userid'])) {
    @@ -426,6 +441,10 @@
     		$web_rec = $app->tform->getDataRecord($this->id);
     		$web_config = $app->getconf->get_server_config(intval($web_rec["server_id"]),'web');
     		$document_root = str_replace("[website_id]",$this->id,$web_config["website_path"]);
    +		$document_root = str_replace("[website_domain_1]",$this->domain_name_split($web_rec['domain'],1),$document_root);
    +		$document_root = str_replace("[website_domain_2]",$this->domain_name_split($web_rec['domain'],2),$document_root);
    +		$document_root = str_replace("[website_domain_3]",$this->domain_name_split($web_rec['domain'],3),$document_root);
    +		$document_root = str_replace("[website_domain_4]",$this->domain_name_split($web_rec['domain'],4),$document_root);
     
     		// get the ID of the client
     		if($_SESSION["s"]["user"]["typ"] != 'admin' && !$app->auth->has_clients($_SESSION['s']['user']['userid'])) {
    @@ -516,4 +535,4 @@
     $page = new page_action;
     $page->onLoad();
     
    -?>
    \ No newline at end of file
    +?>
    
    ispcomm
     
  5. till

    till Super Moderator

    Thanks for the patch. Have you tested the renaming of a domain as this is the only critical part that I see at the moment? E.g. when you rename the domain of a website from e.g. test.int to johndoe.org.

    This new module is needed for the remoting API and it also replaces the code in interface/web/sites/web_domain_edit.php, so the code in the web_domain_edit.php file has to be removed if the plugin is working.
     
    Last edited: Aug 31, 2010
  6. ispcomm

    ispcomm New Member

    I was not aware of this point. I tried to change the domain name from sites->edit then insert a new name in the sitename.ext.... it gets changed in the domain field of the web_domain table but the path stays the same (old path).

    the apache vhost file is updated with a new "virtualhost" directive but sitting on the old path.

    I cannot find where the new path should be calculated. Perhaps a little hint will save me lots of time in searches.

    ADDED: Hmm... apache vhost is somewhat bad. Created with wrong document root:
    Reloading web server config: apache2Warning: DocumentRoot [/var/www/newsite.com/web] does not exist

    I discovered by debugging that both places are called. First is web_domain_edit.php and then the plugin.

    ispcomm.
     
    Last edited: Aug 31, 2010
  7. till

    till Super Moderator

    The web_domain_edit.php contains a function for onAfterInsert and one for OnAfterUpdate. Have you added your code in both functions or only in the onAfterInsert part? There is no other place where the path gets changed plus in the plugin.

    Thats how the plugins are intended to work an why the code in the dit file gets removed.
     
  8. ispcomm

    ispcomm New Member

    Yes, I have inserted the code in both web_domain_edit and in the plugin.

    It also seems that web_document_root and web_document_root_www are never read from the database but generated from the domain name. If this is a case, I might have found a bug.

    I'm doing a little more of debugging right now (but I'm using echo/print statements and it's taking time).

    ispcomm.
     
  9. till

    till Super Moderator

    This might be the case, even if it is not a bug as it matches the current ispconfig implementation of paths. But it might have to be extended then if other path identifiers are available in future.

    Please be aware that there is on "real" path and one symlink (for easier shell navigation for the admin). On current setups, /var/www/domain.com is a symlink and /var/www/clients/client1/web1/ is the real path which uses ID's to ensure that it never changes on a domain name update, otherwise it would break cms systems and scripts that are installed in the website.
     
  10. ispcomm

    ispcomm New Member

    some more information. The plugin is never called on domain edit (rename). It's only called on creation time.

    The onAfterUpdate is called, but the flow is not really clear to me. It seems that it never reaches any of the sql update statements. The domain I'm adding is binded to a reseller and not an end client (shall this matter?).

    Document root is calculated correctly in the upper side of the function, but never updated in the database.

    My guess is that this has nothing to do with the patch....

    Could you have a look on this issue?

    ispcomm.
     
  11. ispcomm

    ispcomm New Member

    Perhaps this can be broken by using a different setting in the server configuration? I'll have a look at how the URL is constructed so that I can change it to be read from the db and this will fix it once for all.

    Yes, found the real dir+(dual) links.

    The way it's done, the paths will stay the same even if the domain is renamed. But any CMS will have a setting for the base_dir (if not calculated automtically). It should not be a big issue (my guess).

    ispcomm.
     
  12. till

    till Super Moderator

    I guess we will have to add a database field for the www_path then in the website database table that we will with the correct path when the website is inserted or updated like we do it now for the real path.

    It would be great if all cms systems and custom scripts would do that, but I've seen a lot of scripts breaking when you e.g. rename a website from test.domain.com to domain.com when it gets switched to live mode. So if you like to kep such problems away from your support team, I can only recommed to stay with the IP approach and only modify the symlink part.
     
  13. ispcomm

    ispcomm New Member

    I agree. Please do so in trunk as I'm still a bit confused as to where to find things (and I can easily break them).

    Meanwhile, I'm installing the debugger on my test setup so I can step trough code more easily. I hope to be up to speed in a few hours.

    I have to agree with you that CMS are usually not well written and a source of continuous headaches. client_id/web_id/ stays the same, and as long as the web server is instructed to not-follow symlinks should be fine. It makes it harder at the beginning to find the correct path and to move sites between clusters.

    But whatever reasoning there's behind the path, I'll try to fix my patch so it works.

    ispcomm
     
  14. till

    till Super Moderator

    I will add the field.

    Thats depends on how you plan your cluster. ISPConfig is able to manage several clusters of replicated servers from one controlpanel and as long as you use the same master ispconfig controlpanel instance for all clusters, then the ID's are unique.
     
  15. ispcomm

    ispcomm New Member

    I saw your change in SVN. One question is in order: Isn't it more efficient to point the site to the real directory instead of the symlink ? A symlink means an extra access for every object on the web root.

    ispcomm
     
  16. till

    till Super Moderator

    The site points in almost all cases to the real directory. Only some special cases (combinations of php modes with suexec etc) made it nescessary to point the site to the symlink. If you switch the php modes, you will see that in the vhost file.

    The problem is as follows: On older ispconfig releases, the clients directory was not inside the /var/www directory, it was /var/clients and not /var/www/clients as it is now. For security reasons, I would have preferred if we could have used /var/clients till now but some apache modules had problems with that. So we had to introduce the switching between the symlink and the real directory depending on the apache modules used in a specific vhost. Since we moved the clients directory to /var/www now to circumvent these problems, we would be able to always use the real path. But there are a few ten thousand install which use the old path layout and we can not break them.
     
  17. ispcomm

    ispcomm New Member

    Till, I got your point. I too have several constraints for not breaking sites, some of which are over 10 years old (ouch). I remember having to recompile some of the suid php interpreters (perhaps it was suphp or was it cgi?) debian packages to allow for sites out of /var/www.

    I installed a second copy of ispconfig with xdebug+eclipse and I get to understand a little better the process.

    It seems that document_root_www is used for some cgi interpreters (fastcgi for example) while document_root is used for the others (plain cgi).

    With my patch, apache vhost is created in the split path (/var/www/sites/a/ab etc) while symlinks stay as default (/var/www/clients/client1/web2 etc).

    FastCGI uses DocumentRoot with a symlink /var/www/testsite3.com/web
    CGI uses the real directory. I havent tested others.

    Renaming a site the way I proposed, would mean moving the directory from the old "attach point" to the new one, which means a little change i server.php is also due (should be easy).

    A problem would be not breaking the old installs. Perhaps it would suffice to create the symlinks they expect to the real directory. An "fopen" from an old site would open the file trough the symlink. php_open_basedir would have this symlink in the path.

    so old sites would work untouched, while new ones could start with the full path.

    There's a problem in onAfterUpdate: the $document_root is calculated correctly (and I added my patch) but it's never written to the database, so server.php cannot do anything about it. Incidentally, the new site name is placed in the log file.

    I can modify onAfterUpdate to update the site with the new path. However, since there's no place to record the old path it will be impossible for server.php to know which site to move around.

    Any idea about a solution? (other than disabling site renaming after it's created) ?

    ispcomm.
     
  18. till

    till Super Moderator

    Thats not the case, the plugin always "knows" the old and new path as ISPConfig has its own event based system that tracke the changes (see serialized objects sys_datalog), every event function in the plugins receive a $data array which contains the old and new values. Everything that is changed in the database onAfterUpdate event is part of the current transaction. So you can safely change the path in that event in the interface and the plugin gets automatically the old and new path values. This way all moving opertions, e.g. for maildirs in the email module, are implemented.

    Regarding the overall changes you proposed, I fear this will break to many setups as not everything will work trough a symlink so we can not implement this as new default. The solution I would propose is that we add some kind of layout selector in the server settings to keep the current layout as default and allow the new scheme as option plus maybe other schemes later.

    The problem with the layout is that some apache modules and settings are really picky and even a small change makes specific combinations to fail. It tooks several months and a lot of patch releases to get this all to a stable point where all combinations work and we can not risk this without a easy option to switch back. So a selector might be the best solution. levae the current setup as it is and implement a new layout beside the current setup in the apache plugin.

    Just as a side note, please do not change server.php. ISPConfig is event and plugin based, so always do changes in the plugins only as a change in the main script which only contains the process management and plugin loading code shall not be nescessary :)
     
  19. ispcomm

    ispcomm New Member

    Till,

    We're missing something (and I don't want to break things). The new structure would be in effect only if someone changes the web server config paths.

    If the web server setting never uses website_domain_x there will be no change accross the whole setup, so in reality there is no real problem.

    Also, if the symlink and the web_dir lead to the same "path" the symlink call will silently fail (I hope so) and there will be no difference between the various CGI/FastCGI/SuPHP etc modules.

    I'll need to read a little more the code to understand the plugin/events structure. Is there any doc/thread where something is explained?

    If it's the frontend to change things and it's server.php to do the changes, what is the glue? Server.php is called from a server.sh, but where's the state retrieved from?

    sorry for so much questions.... I hope I can be usefull once I understand how things are layed down.

    ispcomm.
     
  20. till

    till Super Moderator

    This would only be the case if no changes in the server path are nescessary, but as we will have to add a field already for the www path, we start to change the current system layout. But we will see.

    There is no docs beside the info here in the dev forum.

    server.php is executed once a minute, if there are pending changes to be processed, then it the loads the module and plugin framework which then load all plugins in alphabetical order and attaches them to the event listeners. Then it loads all record changes from the sys_datalog, processes them and calls the event based functions in the plugins. Every even based function recieves the old and new state of the record, so the function uses only the values form that array and never read directly from the database as this would break the atomicity of the actions.
     

Share This Page