2007年6月7日星期四

SPSS和Stata到底能处理多大的数据?

 
五一放假期间把SAS 9.1.3 下载下来,也安装完了,由于缺少完美的SID和使用SAS的经验,很快就从机器里删除了。其实还是挺舍不得它强大的数据处理和商业应用的能力,以后再说吧。还是把手头上SPSS和Stata练熟才好。前一段时间因为要处理一个上亿个case的数据,没有仔细想就把SPSS和Stata放弃了,潜意识就认为它们不能处理或很难处理这么大的数据。后来VB都用上了才算搞定。但,SPSS和Stata到底能处理多大的数据?答案如下:
 
对于SPSS,10.0版本就能处理【2^15 = 32,768 variables  ;2^31 = 2,147,483,648 cases】。10.0及以后的SPSS的variables和cases数量都变为2^31或2^31-1了。其他人的说明:The below is by Jon Peck of SPSS, Inc., and applies to all recent versions of SPSS.There are several points to making regarding very wide files and huge datasets.

        2. The overhead of reading and writing extremely wide cases when you are doubtless not using more than a small fraction of them will limit performance.  And you don't want to be paging the variable dictionary.  If you have lots of RAM, you can probably reach between 32,000 and 100,000 variables before memory paging degrades performance seriously.(我也不会用到那么多的variables)

        5. These points apply mainly to the number of variables.  The number of cases is not subject to the same problems, because the cases are not generally all mapped into memory by SPSS (although Windows may cache them).  However, there are some procedures that because of their computational requirements do have to hold the entire dataset in memory, so those would not scale well up to immense numbers of cases.(估计用SPSS对上亿个cases的数据做个频数分布都非常苦难)

        Modern database practice would be to break up your variables into cohesive subsets and combine these with join (MATCH FILES in SPSS) operations when you need variables from more than one subset.  SPSS is not a relational database, but working this way will be much more efficient and practical with very large numbers of variables. (大量的variables能用subsets,大量的cases呢)

对于Stata,一句话,靠可用的内存。具体见FAQs by Kevin S. Turner, StataCorp:

      1.Under all current 32-bit Windows operating systems (Windows 95, 98, ME, NT, 2000, XP, Vista), the total available address space for any application is 2.1 GB. If you have a dataset larger than 2.1 GB, you will not be able to load it on Stata for Windows.

      2.Unfortunately, even if your dataset is under the 2.1 GB limit, you may run into difficulty when loading it into Stata. The fault again lies with how Windows manages the 2.1 GB address space. You may be surprised to find that a 1.4 GB dataset loaded fine one time, but failed to load a subsequent time. This is simply an unfortunate side effect of Windows memory management.
 
     3.The 64-bit platform will enable you to work with very large datasets. Depending on your operating system, you should be able to allocate as much memory as you have on the machine minus the system requirements. To take advantage of this technology, you will need 64-bit compatible hardware, a 64-bit operating system, and, of course, a 64-bit version of Stata.

     4.As a last resort, you may consider trimming any unnecessary data from your dataset or dividing the dataset into two files. Depending on your data and analysis this may not be feasible, and is only offered as a suggestion.

没有评论: