About site: Knowledge Management/Knowledge Discovery - Mining Customer Data
Return to Reference
  About site: http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml

Title: Knowledge Management/Knowledge Discovery - Mining Customer Data By Gary Saarenvirta. A step-by-step look at a powerful clustering and segmentation methodology.
model&mine Dorian Pyle, author of "Data Preparation for Data Mining", provides resources on data mining, business modeling, and analytical CRM, including: articles, White Papers, downloads, books, information on

Net_Perceptions Real-time relationship marketing and personalization, integrating high-scale data mining, analytic, and recommendation technologies with a direct conduit to action.

Nonlinear_Thinking A thought process that offers simple solutions within complexity reducing the dependence on experts, consultants and external resources. Features articles and tips for discovering novel solutions to r

Psybertron_Knowledge_Modelling_Weblog What, Why and How do we Know ? Research into models for knowledge management in business organisation decision support. (Supersedes Ian's Knowledge Modelling Weblog)

Second_Moment The news and business resource for applied analytics. Powerful content weblog mixing articles, commentary, technique and critique of the intersection of academic KD research and the directed KD of co

UCI_Knowledge_Discovery_in_Databases_Archive An online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas at the University of California at Irvine.


  Alexa statistic for http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml





Get your Google PageRank






Please visit: http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml


  Related sites for http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml
    APR_Smartlogik Suite of products and services designed to search, categorize and profile software for intranet, Internet and portal applications. Features company profile, white papers and contact information.
    ASE_Edge Virtual platform in which knowledge is organized, analyzed and manipulated.
    Information_Retrieval Online text of a book by Dr. C.J. van Rijsbergen of the University of Glasgow covering advanced topics in information retrieval.
    Intellilinker Short piece discussing development of text description, XML, and an interactive electronic librarian substitute.
    ISYS Suite of search software products that finds information in multiple file formats and languages. Features product descriptions, evaluation version download, company profile and contact information.
    Knowledge_Concepts Develops technologies to enhance document management solutions to provide better access to relevant information both inside and outside the organization, irrespective of word of use or language. Featu
    Strategy_Software_Inc Offers a PC-based competitive information management system that can organize, summarize, analyze and share information. Features an overview of the company, contact details, news and job opportunitie
    E-Commerce_News_-_Upgrade_and_Archive__The_Ongoing_Threat_of_Data_Extinction Article. Saving digital information is turning out to breed its own set of unique challenges. (August 28, 2003)
    The_One_Umbrella Australian recruitment company. Including employment opportunities, education and training and information on the field.
    TFPL_KnowledgeRecruit UK firm helps clients recruit knowledge executives and KM teams on a permanent, interim and contract basis. Features candidate profiles, job profiles and salary advice.
    Babson_Knowledge Babson College joint blog of Tom Davenport and Larry Prusak discussing knowledge work, knowledge management and productivity.
    Cindy_Gordan Focuses upon knowledge management, human capital and innovation.
    Collaborative_Thinking Mike Gotta's blog on collaboration, social software, social networking and knowledge management trends, including community-building methods and practices.
    A_Compound_of_Alchymie John Curran viewpoints on knowledge management, intellectual capital, social networking and related topics.
    Dove_Lane Kaye Vivian blog addressing knowledge management and communities.
    Dr__Dan\'s_Daily_Dose Critical review, evaluation, and discussion of all things knowledge management.
    Eclectic_Bill Focus is on knowledge management, change management, learning organizations, mental models, and Theory of Constraints as applied to government, non-profits, and higher education.
    Elsua A blog about knowledge management, knowledge, online communities, social networking and work-life balance. Available RSS/Atom feeds.
    Full_Circle_Online_Interaction Nancy White blog focusing on distributed Communities of Practice (CoPs), communities, online interaction, and distance learning.
    Knowledge_Jolt_with_Jack Jack Vinson writes about knowledge management, personal effectiveness, theory of constraints and other related disciplines. Available RSS/XML feeds for both the blog headlines and comments.
    Knowledge-at-work Denham Grey's blog covering knowledge ecology, communities of practice, KM practices, tools, distance learning, personal KM, and corporate memory.
    Mathemagenic__Learning_and_KM_Insights This klog (knowledge blog) is used as a learning diary that documents journey of Lilia Efimova in the land of knowledge workers' networks, learning, creativity and knowledge sharing.
    Musings_of_a_Social_Architect Amy Jo Kim blog focusing on community architecture, social systems design and knowledge management.
    Portals_and_KM Bill Ives blog discussing practical applications of portals, blogs, and knowledge management.
    Academic_Diversity_Search Specializes in placing women and minorities on university faculty, administrative, executive, scientific or technical staffs. Provides candidate and employer resources.
    Academy_of_Urban_School_Leadership Chicago-based program offering free M.A.T. and Illinois Certification in exchange for 5-year commitment to high-need Chicago Public Schools.
    Agent_K-12_from_Education_Week Administrative and teacher vacancies across the U.S. and abroad. Browse by region or by job title.
    The_Alaska_Education_Employment_Board Search for or list job openings in Alaska. Links to information about Alaska and about teacher jobhunting in general.
    American_Association_for_Employment_in_Education_(AAEE) A professional association comprised of colleges, universities, and school districts whose members are school personnel administrators and college and university career services officers. AAEE works t
    Applying_for_a_Teaching_Position How to learn about job openings and get hired.
    Be_A_NYC_Public_School_Teacher News and information on the jobs, schools, pay, criteria for employment, and the New York City area.
    CalTeach A one-stop information and referral service for individuals considering or pursuing teaching careers in California.
    Capita_Education_Resourcing UK teacher placement agency.
    Career_Opportunities_in_Vermont_Education Listing of current K-12 education job openings in Vermont public schools.
    Carney,_Sandoe,_&_Associates Recruits teachers and administrators for placement in private, independent schools across the United States and abroad.
    Case_Personnel Pennsylvania licensed employment service securing careers in education for certified and non-certified college graduates.
    ChristianSchool_com K-12 Christian school resource web site. Schools can advertise their position openings and job seekers can post their resumes.
    Cooperative_Teacher_Application_Pool_(Colorado) Submit an application online to teach in Colorado through Centennial BOCES (Board of Cooperative Educational Services). This is a no-cost service.
    Ed_Jobs_U_Seek Resource to professionals engaged in education-related job searches.
    Education_America_Network A nationwide educational employment network that helps school districts fill job vacancies, and helps educators find jobs.
This is now2007.com cache of m/ as retrieved on 2008.11.21 now2007.com's cache is the snapshot that we took of the page as we crawled the web. The page may have changed since that time.
:: IBM Database Magazine :: var ckRef=document.referrer; if(ckRef && ckRef.indexOf('/as5/redirect/')==-1 || !ckRef) { document.write(''); document.close(); } United Business Media IBM Database Magazine Home DB2 Informix U2 Application Development Analytics Content & Discovery Skills & Education Current Issue Subscribe Bookstore Community Wiki Blog Events Careers RSS Mining Customer DataMining Customer Data A step-by-step look at a powerful clustering and segmentationmethodology. By Gary SaarenvirtaFall 1998Printer-Friendly VersionEmail this StoryBookmark to del.ico.usDigg It!Customer clustering and segmentation are two of the mostimportant data miningmethodologies used in marketing andcustomer-relationship management. They use customer-purchasetransaction data to track buying behavior and create strategicbusiness initiatives. Businesses can use this data to dividecustomers into segments based on such "shareholder value"variables as current customer profitability, some measure ofrisk, a measure of the lifetime value of a customer, andretention probability. Creating customer segments based on suchvariables highlights obvious marketingopportunities.For example, high-profit, high-value, and low-risk customersare the ones a company wants to keep. This segment typicallyrepresents the 10 to 20 percent of customers who create 50 to 80percent of a company's profits. A company would not want to losethese customers, and the strategic initiative for the segment isobviously retention. A low-profit, high-value, and low-riskcustomer segment is also an attractive one, and the obvious goalhere would be to increase profitability for this segment.Cross-selling (selling new products) and up-selling (selling moreof what customers currently buy) to this segment are themarketing initiatives of choice.Within behavioral segments, a business may create demographicsubsegments. Customer demographic data does not typicallycorrelate to customer shareholder value, which is why you don'tuse it together with behavioral data to create segments. However,demographic segmenting can steer marketers in selectingappropriate advertising, marketingchannels, and campaigns tosatisfy the strategic behavioral segment initiatives.For example, imagine a bank with a high-profit and alow-profit behavioral customer segment, both of which have ademographic subsegment of young-family, high-incomeprofessionals. The marketer would want to ask the followingquestion: Why do these similar demographic segments behavedifferently, and how do I turn the low-profit group into ahigh-profit group? It is difficult if not impossible to answerwhy, but data miningprovides an answer to the how. Affinityanalysis may reveal that the high-profit group of young, wealthyprofessionals has a distinct product pattern -- mortgages, mutualfunds, and credit cards. The lower-profit group may have aproduct pattern that partially fills that of the high-profitgroup -- mutual funds and credit cards. The marketing campaign toincrease the profitability of the low-profit segment would thusbe to market mortgages to them.In summary, behavioral clustering and segmentationhelp derivestrategic marketing initiatives by using the variables thatdetermine customer shareholder value. By conducting demographicclustering and segmentation within the behavioral segments, youcan define tactical marketing campaigns and select theappropriate marketing channel and advertising for the tacticalcampaign. It is then possible to target those customers mostlikely to exhibit the desired behavior (such as buying a mortgageproduct, in our bank example) by creating predictive models (seeFigure 1).THEORY TO PRACTICEI recently worked as part of a group conducting a customersegmentation study for The Loyalty Group in Canada. The LoyaltyGroup runs an AIR MILES Reward Program (AMRP) for a coalition ofmore than 125 Canadian companies in all industry sectors,including finance, credit card, retail, grocery, gas, andtelecommunications. Some of the sponsors in these industries areBank of Montreal,Hudson Bay, Safeway Canada, Great Atlantic& Pacific Co., and Shell Canada. More than 60 percent ofCanadian households are enrolled in the program and can shop atmore than 10,000 participating locations. The AMRP is afrequent-shopper program; that is, the consumer can collect AIRMILES Travel Miles (AMTM) for making purchases at the coalitionsponsors. Customers can then redeem the Travel Miles collectedfor rewards, which include not only air travel, but hotelaccommodation, rental cars, theatretickets, tickets forprofessional sporting events, a family night at the movies, andmerchandise.The various coalition partners capture consumer transactionsand transmit them to The Loyalty Group, which stores thesetransactions and uses the data for database marketing initiativeson behalf of the coalition partners. The Loyalty Group datawarehouse currently contains more than 6.3 million householdrecords and 1 billion transaction records.In meeting the database marketing needs of thecoalitionpartners, the Reward Program has employed standard analyticaltechniques. The business analysts use Recency, Frequency,Monetary value (RFM) analysis, online analytic processing tools,and linear statistical methods to mine the data for marketingopportunities and analyze the success of the various marketinginitiatives undertaken by the coalition and its partners.In the recent project I was involved in, the specificobjectives were to create a customer segmentation using a datamining tooland compare the results to an existing segmentationdeveloped using RFM analysis. The goal of the project was toprove the value of data mining and recommend follow-up datamining initiatives. We extracted the data required for the AMRPdata warehouse and loaded it onto our data mining platform -- DB2Universal Database Enterprise-Extended Edition parallelized overa five-node RS/6000 SP massively parallel system. We choseIntelligent Miner for Data as our data mining tool for severalreasons: mostimportant, because its algorithms have existed forsome time and are well tested; it has categorical clustering andproduct association algorithms which are not available in mostother tools; and it's highly scalable.Figure 2 shows the data model we used as the primary source ofdata for this study. The data we used consisted of approximately50,000 customers and their associated transaction data for a12-month period. The "shareholder value" variables chosen forthis study included revenue, customertenure, number of sponsorcompanies shopped at over the customer tenure, number of sponsorcompanies shopped at over the last 12 months, and recency (inmonths) of the last transaction, as well as severalbusiness-specific variables.The revenue data in the data model of Figure 2 was containedin the transaction table. Each transaction record contained arevenue figure that we used to estimate profitability by applyinga grossprofit margin. (You can develop more sophisticated profitmodels, but they were outside the scope of our work.) Wecalculated the other shareholder value variables by aggregatingthe transaction data and then joining it to each customerrecord.The AIR MILES Reward Program categorized the sponsor companiesinto 14 categories (such as banking, credit card, departmentstore, and so on). The category labels in the results presentedhere are denoted by category numbers to protect theconfidentiality of thecoalition partners.Listing 1 is a fragment of SQL code used to join thetransaction data to the customer file to create the clusteringinput record required by both the demographic and neuralclustering algorithms we used. This code illustrates how thetransaction data is pivoted, aggregated, and inserted into eachcustomer record. The code sample was executed for all 14categories for the first two quarters of 1997.DATA CLEANSINGAfter all the data variables were created on eachcustomerrecord (14 categories 3 3 variables per category 3 2 quarters =84 variables), the missing values needed to be treated. Customerdemographic data usually has a high percentage of missing values.Because this type of data is usually categorical, you can codethe missing values with an unknown value. If a large enoughportion of the field is missing, it may be discarded. Thedecision to discard a field depends not only on the size of themissing value portion but also on why the field contains amissing value. If, for example, the missing value was generatedby an unanswered question on a survey where other questions wereanswered, the result is not necessarily missing. Coding themissing value as unanswered may provide some insight into thosecustomers who chose not to answer a particular question. In thiample, it may be acceptable to have 50 percent or more of thedata coded as unanswered. However, if a value appears to berandomly missing, you may want to set a lower cutoff point.Thetreatment of missing values in numeric fields is moredifficult because the assignment of a numeric value to missingfields will change the distribution and statistics of the field.For transaction data, missing values are easily assigned thevalue zero. If a customer has no credit card transactions or anull value is inserted into the customer record for the "numberof credit card transactions" field, this means that the customerhas zero credit card transactions. All null values of this typeshould becoded with a zero. If a customer age field has amissing value, choosing an appropriate numeric value is much moredifficult. Assigning a numeric value will likely alter the meanand mode of the distribution. This fact is much more critical instatistical analysis than in data mining because statisticalmethods typically have more strict criteria on the variabledistributions. Data mining algorithms do not typically requiredata to be normally distributed, although normal distributiondoes allow the algorithmsto work better in most cases.Nonetheless, you still must take some care.There are several ways to handle missing values in numericdata fields, including:• Assign the average, mean, modal value, or other value.This is the simplest method but the one that will probably havethe greatest impact on the variable's distribution. You shoulduse it with caution and only when the effect is minimal.• Distribute the data using the probability distributionof the nonmissingrecords. This method is not very difficult toapply and will not change the distribution of the data very much,but it may cause your data mining results to be in error if theassigned variable is important to the model you're building.Assigning value using the probability distribution will introducequasirandom errors into the model.• Segment the data using the distribution of anothervariable, and assign segment averages to missing records in eachsegment. This method is not very difficult toapply. The resultscan be very good if the variable used for segmentation is chosenbecause it is correlated to the variable being treated. In thiscase, segmentation, even arbitrary, will allow reasonableaverages to be assigned per segment. The larger the number ofsegments used, the more accurate the results. As the number ofsegments chosen approaches the number of records, this isequivalent to building a model. You can also create asegmentation using a number of different variables and then applythesegment averages.• Distribute the data using the probability distributionof the nonmissing records using segments based on othervariables. If the number of segments chosen is large, thequasirandom effect will be minimal and will likely provide abetter result than the previous method. This method is moredifficult than the others.• Build a classification model and impute the missingvalues. This is the best method by far, but it is time consumingto build a model. Prior tousing this or any of the othermethods, consider the trade-off between the time you'll spendmodeling the missing values and the potential benefit.In the AIR MILES study all the transaction variables (revenue,Travel Miles collected, and number of transactions) were assignedzero when null using the following SQL code:update target set cat1_miles_q2=0where cat1_miles_q2 is null;update target set cat1_trans_q2=0 where cat1_trans_q2 is null;update target set cat1_revenue_q2=0where cat1_revenue_q2 is null;All categorical variables were made consistent, and unknownvalues were assigned a code with SQL similar to thefollowing:update target set gender='U' where gender is null or gender=' ';When coding missing values, it's useful to create a binaryvariable that identifies the coded records. We created a binaryfield for all missing categorical fields and numeric fields usingSQL code similar to the following:alter tabletarget add unknowngender smallint;update target set unknowngender=1 where gender='U';update target set unknowngender=0 where unknowngender is null;We also used Intelligent Miner's statistical functions toprofile the data to ensure that all fields had been properlycleaned.DATA TRANSFORMATIONAfter you've cleaned your data, treated all missing andinvalid values, and made the known valid values consistent,you're ready to transform the data. The data in itsoriginal formis valuable, but transformations will maximize the informationcontent that you can retrieve. There are two types of datatransformation:• Data distribution transformation. This type oftransformation involves mathematically altering the distributionof the variable.• Data creation. This type of transformationinvolves the creation of new variables by combining existingvariables to form ratios, differences, and so forth.For statistical analysis,the data transformation phase iscritical because some statistical methodologies require that thedata be linearly related to an objective variable, normallydistributed, and free of outliers. Artificial intelligence andmachine learning methods do not strictly require the data to benormal or linear, and some methods -- the decision tree, forexample -- don't even require you to deal with outliers. This isa major difference between statistical analysis and data mining.The machine learning algorithms havethe capability toautomatically deal with the nonlinearity and nonnormaldistributions, although the algorithms will work better in manycases if these criteria are met.For machine learning and artificial intelligence methods thereasons for changing the variable distributions (type 1 datatransformation) include:To remove the effect of outliers. An outlier, oroutlying value, is a record whose value for a particular field isoutside the field's normal range, as defined by ~99percent ofthe other possible values. If the outlying value is extreme, itcould seriously alter the accuracy of a model that is built. Foralgorithms that minimize root mean square error or similarcriteria, an outlier will cause the algorithm to minimize theerror in modeling outlying records at the expense of the accuracyof the remaining ~99 percent of the records. Sometimes outlyingvalues are useful and should not be removed. When doing frauddetection or deviant detection, the outliers may be the veryrecords you're searching for and should be left untouched.To make the data more interpretable. Many transactionvariables, such as revenue and number of transactions, have ahighly skewed distribution. Using the data in this format makessome visualizations difficult to interpret. Using adiscretization scheme or taking the log transform of such avariable will normally distribute the data, making the resultseasier to interpret and sometimes also improving the quality ofthe results.There are several data creation transformations that are veryuseful and can dramatically improve the result of any data miningproject. These include:Ratio variables. Creating ratios is critical for alldata mining projects, including clustering, segmentation, andprediction. Consider the following example: Customer 1 hascontributed $100 profit to a business, distributed as $50 forproduct A, $30 for product B, and $20 for product C. Thiscustomer has been active for 10 months. Customer 2hascontributed $10 profit to the business, distributed as $5 forproduct A, $2 for product B, and $3 for product D. This customerhas been active for one month.An algorithm comparing the customers' product A spending wouldconclude that customer 1 is much more important to product Aprofitability than customer 2 because customer 1 contributed 10times the profitability of customer 2. In this situation, analgorithm would have difficulty reaching the proper conclusionswithout the use of ratios. Atfirst glance, customer 1 appears tobe much more profitable. In reality, the two customers have verysimilar profitability contributions, but customer 1 has beenaround 10 times longer, accounting for the overall profitabilitydifference. Because customer 2 has the same potential value over10 months, he or she should be considered just as valuable ascustomer 1. For the AIR MILES project, we used the following codeto calculate total variables to be used for normalization:update target settotal_miles=cat1_miles_q1 + cat1_miles_q2+cat2_miles_q1 + cat2_miles_q2+cat3_miles_q1 + cat3_miles_q2+cat4_miles_q1 + cat4_miles_q2+...cat14_miles_q1 + cat14_miles_q2);update target set cat1_miles_q2=cat1_miles_q2/total_mileswhere total_miles > 0;All transaction variables were normalized with the appropriatetotal transaction variable.Time-derivative terms. Creating time-derivative termsis important in prediction because the variation of data overtime isfundamental to that activity. It is also important inclustering and segmentation if you want to track changes invariables over time (for example, if your business wants to knowwhich customer variables are growing or shrinking over time).Time derivatives should be calculated at a minimum for predictionof the first and second derivatives. To maximize the time contentincluded in the model, all derivative terms should be calculatedup to order n-1, where n is the number of time periods in thedata set. Thepartial derivatives can be estimated using finitedifferences. The derivative terms are not required fortime-series prediction algorithms (such as the recurrentfeed-forward neural network), which capture the time delayeffects in the data. For this study we created a time-seriesvariable to capture the time difference in the transactionvariables between quarters, using the following code:update target set cat1_miles_d = cat1_miles_q2-cat1_miles_q1;Discretization usingquantiles. Discretization ofnumeric data using quantiles is a very good way to normalizedata. Creating discrete ranges using quantile break points makesthe data easy to interpret. The quantile break points we used todiscretize the data for the our study were 10, 25, 50, 75, and90.The values of the variable at these breakpoints weredetermined, and the data was broken into six possible values.Following is a SQL code sample that illustrates how the data wasdiscretized:update targetset AMTM12_=6;update target set AMTM12_=5 where AMTM12 <= 1824);update target set AMTM12_=4 where AMTM12 <= 1151);update target set AMTM12_=3 where AMTM12 <= 514);update target set AMTM12_=2 where AMTM12 <= 39);update target set AMTM12_=1 where AMTM12 <= 1);update target set AMTM12_=0 where AMTM12 = 0);The previous SQL code example achieves the followingif-then-else block of logic:if (AMTM12 = 0) then AMTM12_ = 0 ;else if (AMTM12 <= 1) then AMTM12_ = 1 ;else if ( AMTM12 <= 39) then AMTM12_ = 2 ;else if ( AMTM12 <= 514) then AMTM12_ = 3 ;else if ( AMTM12 <= 1151) then AMTM12_ = 4 ;else if ( AMTM12 <= 1824) then AMTM12_ = 5 ;else AMTM12_ = 6;The selection of the quantiles for the discretization isarbitrary. The values we chose have proven useful in ourexperience. The buckets can be interpreted as zero, very low,low, low side of average, high side of average, high, and veryhigh. The quantile breaks were generated automatically. Theresulting distributions were then profiled and manually adjustedto be unimodal (whereby the data distribution has only one peak)or at least monotonic (whereby the distribution has either apositive or a negative slope, but not both). In some cases thedata was left bimodal if both peaks were significant, forexample, if one peak was for zero values.Refer to Figure 3 and Figure 4 for a pre- and postdiscretizedview of the data inIntelligent Miner's statistics visualizer.The bar graphs in the visualizations represent the histograms forfloating point and integer variables. The pie charts representthe distribution of values for categorical fields. A maximum of99 buckets was chosen for presentation.Discretization using ranges. Sometimes it's necessaryto manually prescribethe ranges for a particular discretizationfor comparison purposes. Existing data or external data -- suchas government census, survey, or list-brokered data -- maycontain variables that are collected in buckets. In order tocompare your internal data to an external source, you mustdiscretize the data using the same ranges. You can do thisdiscretization using Intelligent Miner's data processingfunctions.Mathematical transforms. Mathematical functions appliedto transform the data areuseful to standardize nonnormaldistributions and are also useful when attempting to linearize avariable. Some mathematical functions applied include logtransforms, arranged transforms, and polynomial transforms.Log transformations. To normalize a variable with ahighly skewed distribution, you'd typically use a logtransformation. Log transforms also tend to reduce the effect ofoutliers. To prepare our data for clustering with the neuralclustering algorithm, we standardized some of thecontinuousvariables using a logarithmic transform. Following is a codeexample of how we completed this:if AMTM6 <= 0 then LAMTM6=0;else LAMTM6=LOG(AMTM6+1);This transformation was done for variables with very fewnegative values. For variables with many negative values, we usedthe following transformation:if AMTM3DFF < 0 then LAMTM3DF=-LOG(-(AMTM3DFF-1));else if AMTM3DFF=0 then LAMTM3DF=0;else LAMTM3DF=LOG(AMTM3DFF+1);This transformationresults in a distribution with threemodes: one normal mode below zero, one mode at zero, and onenormal mode above zero. This transformation may not be valid fora prediction or classification model, but the neural clusteringalgorithm can distinguish between negative, positive, zero, anddiffering levels of positive and negative values better than withthe original distribution.Figure 5 shows the log transformed data.Range transformations. Range transforms are useful forremoving the effect of outliers and constricting the range of avariable to between 0 and 1. The original variable is normalizedusing the range of possible values.Polynomial transformations. Polynomial transformations areuseful when linearizing the data if the data is continuouslydistributed. Looking at squares, cubes, and higher powers as wellas square roots, cube roots, and so on of variables is part ofthe exploratory data analysisprocess. Any linearization of thedata by manual intervention where possible will cause better datamining results.After selecting, preparing, and transforming the data, you'reready to run the data mining algorithms.DATA MININGThe workflow diagram in Figure 6 illustrates the flow paththat a typical clustering/segmentation project follows. The firststeps in the clustering process involve selecting the data setand the algorithm you want to use. In our study, we wereinterested in using the demographic clustering algorithm. Becausethis algorithm works best when the data is discretized, we chosea data set in which all continuous variables were discretized.(The benefit of using the demographic clustering algorithm withdiscretized input data is that the results are easy tointerpret.) The next step in the process is to choose the basicrun parameters for the algorithm. The basic parameters availablefor demographic clustering include:Maximum number of clusters. This feature isunique to Intelligent Miner. You specify the maximum number ofclusters allowed; the algorithm may come up with fewer. Mostother clustering algorithms require that you specify an exactnumber of clusters.• Maximum number of passes through the data. Thisparameter indicates the maximum number of times the algorithmwill read the data. The higher this number and the lower theaccuracycriterion, the longer the algorithm will run, and themore accurate the result will be. This parameter is a stoppingcriterion for the algorithm. If the algorithm has not satisfiedthe accuracy criterion after the maximum number of passes, itwill stop.• Accuracy. This number is a stopping criterionfor the algorithm. If the change in the Condorcet criterionbetween data passes is smaller than the accuracy (as apercentage), the algorithm will terminate. (The Condorcetcriterion is avalue between 0 and 1, where one indicates aperfect clustering -- that is, all clusters are homogeneous andentirely different from all other clusters.)• Similarity threshold. This parameter defines thesimilarity threshold between two values in distance units. Thedefault distance unit is the absolute number. If the similaritythreshold is 0.5, then two values are considered equal if theirabsolute difference is less than or equal to 0.5.For our first clustering run, weselected a maximum number ofclusters larger than the number we desired at the end of theproject. By selecting more, we allowed the algorithm to choosefewer clusters if necessary. If the algorithm had come back withthe maximum, we would have known that there were likely moreclusters. The number of clusters you choose should be driven byhow many clusters your business can manage. In our experience,this number is smaller than 10 for most companies. For the ourstudy, we chose a maximum of nine clusters, amaximum of fivepasses through the data, and an accuracy of 0.1.The input variables we selected for the first runincluded:• Number of products the customer purchased overlifetime• Number of products the customer purchased in the last12 months• Customer's revenue contribution over lifetime• Customer tenure in months• Ratio of revenue to tenure (Ratio 1)• Ratio of number of products to tenure (Ratio 3)•Region• Recency• Tenure (number of months since customer first enrolledin the program).These were the shareholder value variables for the project.All other discrete and categorical variables and some interestingcontinuous variables were input as supplementary variables thatwe used to profile the clusters but not to define them. Theability to add supplementary variables at the outset ofclustering is a very useful feature of Intelligent Miner thatallows quick andeasy interpretation of clusters using data otherthan the input variables.We had the entire data set output with the cluster informationappended to the end of each record. We wanted the entire data setso that we could directly compare the results of other clusteringruns (using both the demographic clustering and neural clusteringalgorithms) by cross-tabulating the cluster IDs from the variousschemes. Because Intelligent Miner has multiple algorithms, itlets you use the output of one algorithmas the input to another.In our experience, the algorithms are more powerful when used incombination than when applied alone.Figure 7 shows the results of our first clustering run inIntelligent Miner's cluster visualizer, which is used by bothdemographic and neural clustering. Each of the horizontal stripsrepresents a cluster (with its ID number on the right). Theclusters are ordered from top to bottom in order of size, withthe number to the left indicating the size of a cluster as apercentage of the universe. Variables are ordered from left toright in order of importance to the cluster, based on achi-square test between variable and cluster ID. This is thedefault metric. Other metrics include entropy, Condorcetcriterion, and database order. The variables used to defineclusters are without brackets, while the supplementary variablesappear within brackets. Numeric (integer), discrete numeric(small integer), binary, and continuous variables have theirfrequency distribution or histogramshown as a bar graph. The redbars in the foreground indicate the distribution of the variablewithin the current cluster. The gray solid bars in the backgroundindicate the distribution of the variable in the whole universe.The more different the cluster distribution is from the average,the more interesting or distinct the cluster.Categorical variables are shown as pie charts. The inner pierepresents the distribution ofthe categories for the currentcluster, and the outer ring represents the pie chart distributionof the variable for the entire universe. Again the more differentthe distribution of the variable is for the current cluster ascompared to the average distribution, the more interesting ordistinct the cluster.After running the cluster algorithm, the next step is tocharacterize the clusters qualitatively.Following are characterizations of some of the interestingclusters from Figure 7: TheGold98 variable is a binary variablethat indicates the best customers in the database. The criteriafor best customers was created previously by the business usingRFM analysis. Our clustering model seems to agree very well withthis existing definition: Most of the clusters seem to havealmost all Gold or no Gold customers. As a first pass, this is avery exciting result. We've confirmed the current Gold segmentwith little effort.To be confident in your data mining results, you should alwaysbeable to observe the current business knowledge in results.Observing the current business knowledge provides confidence thatthe data selection and data preparation efforts have been valid.If you then observe results that were previously unknown, you canhave confidence in them.Our clustering results not only validate the existing conceptof Gold customers, they extend the idea of the Gold customers bycreating clusters within the Gold98 customer category. Perhapsthis builds a case to create a"platinum" customer group!Cluster 6 can be interpreted as almost all Gold98 customers,whose revenue, AMTM collected lifetime to date, AMTM collected inthe last 12 months, revenue per month, and AMTM lifetime to dateper month are all in the 50th to 75th percentile.Cluster 3 can be interpreted as containing almost no Gold98customers. Its customer revenue, AMTM collected lifetime to date,AMTM collected in the last 12 months, revenue per month, and AMTMlifetime to date per month are all inthe 25th to 50thpercentile.Another interesting cluster is number 5. This clusterrepresents about 9 percent of the population. The customerrevenue, AMTM lifetime to date, and AMTM lifetime to date permonth are all in the 75th percentile and above, skewed to almostall greater than the 90th percentile. The variables Gold95,Gold96, and Gold97 represent the status of the customers in thecalendar years 1995, 1996, and 1997. The fraction of customerswho were Gold was increasing by year! This lookslike a veryprofitable cluster. Figure 8 gives a detailed view of cluster5.Following the demographic clustering process, we conducted aneural clustering run for comparison purposes. Any clusters foundby both methods are validated and can be applied confidently. Ifyou don't find similar clusters (this is rarely the case),analyze each method independently and choose the model that seemsto be scientifically "better"or which you can implement moreeffectively. Discretization is artificial and may cause adifference between the neural clustering and demographicsresults. This is one reason to compare the results of bothclustering methods.Figure 9 shows our neural clustering results. This result wascreated using the log transformed variables from Figure 5. Allthe discretized data was used in the model as supplementaryvariables. The results in Figure 9 illustrate one of the valuesof discretization: It ismuch easier to interpret the neuralclusters using a combination of the discretized and continuous(log transformed) variables. If the difference in the resultingmodels is due to the discretization, the neural clustering resultcan be applied and the discretization used only to interpret theresults.PROFILING CLUSTERSThe next step in the clustering process is to profile theclusters by executing SQL queries.The purpose of profiling is toassess the potential business value of each clusterquantitatively by profiling the aggregate values of theshareholder value variables by cluster.Table 1 provides an example of a profile of revenue, number ofproducts purchased, and customer tenure. The leverage column is aratio of revenue to customer. The table shows that cluster 5 isthe most profitable cluster, representing about 35 percent of therevenue yet only 9 percent of the customers. The leverage ratiois thus the highest for this cluster. We can also see that asprofitability increases, so does the average number of productspurchased. The product index is the ratio of the average numberof products purchased by the customers in the cluster divided bythe average number of products purchased overall. It is alsointeresting to note that customer profitability increases astenure increases.The profile of the clustersshows that there is a businessopportunity in increasing the number of products purchased bycustomers.From this simple result, it is possible to derive somehigh-level business strategies. It is obvious that the bestcustomers (considering only the data contained in the table) arecontained in clusters 2, 5, and 7. These customers have a higherrevenue per person than the customers of other clusters, asindicated by the leverage column. Some possible strategiesinclude:• A retentionstrategy for best customers (those inclusters 2, 5, and 7)• A cross-sell strategy for clusters 2, 6, and 0 bycontrasting with clusters 5 and 7. Clusters 2, 6, and 0 have aproduct index close to those of clusters 5 and 7, which have thehighest number of products purchased. Because the clusters areclose in number of products purchased, it shouldn't be a bigstretch to convert customers from clusters 2, 6, and 0 toclusters 5 and 7. By comparing which products are bought by thebestcustomers to those purchased by those in clusters 2, 6, and0, we can find products that are candidates forcross-selling.• We can similarly cross-sell between clusters 3 and 4and clusters 2, 6, and 0 because they are close in value.• The strategy for cluster 1 would be to wait and see. Itappears to be a group of new customers for which we have not yetcollected sufficient data to determine what behaviors they mayexhibit. We might want to inform cluster 1 of the AIR MILESprogram'sproducts and services to make them profitablequickly.• The strategy for cluster 8 may be to refrain fromspending any significant marketing dollars on them. Cluster 8appears to be the worst cluster, with a very low revenuepercentage. These customers purchase very few products eventhough they have been with thNEXT STEPSThe results of our study drew several reactions from TheLoyalty Group executives. The excellent visualization of resultsallowed for meaningful andactionable analysis. The executiveswere able to see that the original segmentation methodology wasvalidated, but that refinements to the original segmentationcould prove valuable.Based on the study results The Loyalty Group decided toundertake several further data mining projects, including severalpredictive models for direct mail targeting, further work onsegmentation using more detailed behavioral data, and opportunityidentification using association algorithms within the segmentsdiscovered.I owe credit to Dr. Hammou Messatfa from the IBM EuropeanCenter of Applied Mathematics lab in Paris for the discretizationscheme described in this article.Gary Saarenvirta has worked in the business intelligenceindustry for the past eight years. He is currently principalconsultant at Loyalty Consulting, providing data mining and datawarehousing services to Global 2000 corporations. Prior to thishe was a seniorconsultant at METEX Systems Inc., a consultancyspecializing in object-oriented client/server applicationdevelopment. You can reach him at gary@airmiles.ca. CAREER CENTER Ready to take that job and shove it?Open | Close SEARCH JOBS Keyword(s): State: Browse By: State | City Recent Articles From TechCareers - industry trends, advice, and other career related content. Career Profiles - Advice from tech pros who've "made it." Hiring Reports - Company hiring and recruiting strategies. Job Search Tips - Experts share their secrets on job hunting. RECENT JOB POSTINGS Johns Hopkins Univ Carey Business School seeking Asst Dean for IS in Baltimore, MD Vanguard seeking Line Manager in Valley Forge, PA INVIA Medical Imaging Solutions seeking Software Engineer in Ann Arbor, MI Citrus Community College seeking Programmer Analyst II in Glendora, CA City of Westland seeking MIS Director in Westland, MI CAREER NEWS 10 Search Engines You Don't Know About Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast. var express = Array("Ready to take that job and shove it?", "Overworked and under-appreciated?", "Work - right now, just another four letter word.", "Stuck in a rut? Nothing is permanent.", "Bonus? Hah! More like rounding error."); var exp = document.getElementById("expression"); var num = Math.floor(Math.random()*5); exp.firstChild.nodeValue = express[num]; function jobsopen(){ var bottomBox = document.getElementById("bottomBox"); var tech_container = document.getElementById("tech_container"); tech_container.style.height = "auto"; bottomBox.style.visibility = "visible"; } function jobsclose() { var bottomBox = document.getElementById("bottomBox"); var tech_container = document.getElementById("tech_container"); tech_container.style.height = ""; bottomBox.style.visibility = "hidden"; } Subscribe to the new digital version of IBM Database Magazine New Digital Version IBM DATABASE MAGAZINE > Current Issue > Browse by topic > Browse by column > Archives > Subscribe > Write an article > Advertise Sponsored links: Learn how to Manage SMF Data in Today's Mainframe Environment SOA: Convergence and Consolidation Tech Report– free download. Get Tips for Deploying More Energy-Efficient Storage Click Here Subscribe to the IBM Database Magazine Newsletter Email Address * First Name Last Name HTML Preference HTML Text   Fields with * are required.   Home Subscribe Advertise E-Newsletter Archives Electronic Books Contact Privacy Policy About Us Site Map IBM is a registered trademark of the International Business Machines Corporation and is used by TechWeb, a division of United Business Media LLC, under license. All material published in IBM Database Magazine Copyright © 2008 TechWeb, a division of United Business Media LLC. ALL RIGHTS RESERVED. Reproduction of material appearing in IBM Database Magazine is forbidden without permission. Privacy Statement • Your California Privacy Rights • Terms Of Service Visit these other IBM and TechWeb Partner Sites: : Maximizing ROI Through Business Process Management (BPM) and Service-Oriented Architecture (SOA) Internet Evolution – The Macrosite for News, Analysis, & Opinion About the Future of the Internet Business Innovation – Technology Strategies and Solutions for Driving Business Success :: IBM Database Magazine ::
 

By

Gary

Saarenvirta.

A

step-by-step

look

at

a

powerful

clustering

and

segmentation

methodology.

http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml

Mining Customer Data 2008 November

dvd rental

dvd


By Gary Saarenvirta. A step-by-step look at a powerful clustering and segmentation methodology.

Rules




© 2005 Internet Explorer 5+ or Netscape 6+

Recommended Sites: 1. Arts - Business - Computers - Games - Health - Home - Kids and Teens - News - Recreation - Reference - Regional - Science - Shopping - Society - Sports - World Miss Gallery - Top Anime Hentai - DVD rental by mail - Mortgages - Buy Anything On eBay - Myspace Layouts - Salvage cars - Cell Phones
2008-11-21 18:40:48

Copyright 2005, 2006 by Webmaster
Websites is cool :)