{"id":6247,"date":"2026-02-10T12:35:57","date_gmt":"2026-02-10T11:35:57","guid":{"rendered":"https:\/\/datascience.unifi.it\/?page_id=6247"},"modified":"2026-02-26T16:25:51","modified_gmt":"2026-02-26T15:25:51","slug":"tip-1","status":"publish","type":"page","link":"https:\/\/datascience.unifi.it\/index.php\/tip-1\/","title":{"rendered":"Tip 1 \u2014 Take it easy with preprocessing"},"content":{"rendered":"<style type=\"text\/css\">\nbody {\n  font-family: 'Raleway', sans-serif;\n  font-size: 16px;\n  line-height: 1.4;\n  color: #2b3242;        \/* higher contrast *\/\n  font-weight: 500;      \/* thicker paragraphs without changing font *\/\n}\n<\/style>\n<p><img decoding=\"async\" style=\"width: 100%; height: auto; display: block;\" src=\"https:\/\/datascience.unifi.it\/wp-content\/uploads\/2025\/11\/tips-banner.jpeg\" alt=\"Banner\" \/><br \/>\n<!--\n\n\n<h1 style=\"position: absolute; top: 30%; left: 67%; transform: translate(-50%, -50%); font-size: 50px; font-weight: 600; color: #981824; text-shadow: 3px 3px 8px rgba(0, 0, 0, 0.6); background: none; padding: 0;\">Tips &amp; Tricks<\/h1>\n\n\n--><\/p>\n<p style=\"text-align: right; font-size: 18px; font-style: italic; color: #48546e; margin-top: 5px; margin-right: 10px;\"><a href=\"https:\/\/datascience.unifi.it\/index.php\/think-tank-group\/\" target=\"_blank\" rel=\"noopener\">FDS\u2019<br \/>\nThink tank<\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Welcome to our Tips &amp; Tricks page!<\/strong><br \/>\nA space where we share quick insights, methods, and good habits with data scientists.<\/p>\n<p>Good data science isn\u2019t just about models or fast algorithms. It\u2019s about the choices we make before, during, and after the analysis, and about the questions we ask along the way. Asking the right question is often the hardest and most valuable part of the work.<\/p>\n<p>Good analysis comes from curiosity, attention, and flexibility.<\/p>\n<p>Enjoy your reading!<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/datascience.unifi.it\/index.php\/tips-page\/\"><u>Go back to the list<\/u><\/a><\/p>\n<h1>Tip 01 \u2014 Take it easy with preprocessing<\/h1>\n<p><img decoding=\"async\" style=\"display: block; margin: 1rem auto; box-shadow: 0 6px 20px rgba(0,0,0,.12), 0 2px 6px rgba(0,0,0,.08); border-radius: 10px; background: #fff;\" src=\"https:\/\/datascience.unifi.it\/wp-content\/uploads\/2025\/11\/gufetto.gif\" width=\"40%\" \/><\/p>\n<p>Sometimes data scientists go a little too far with preprocessing! A frequent mistake is to transform the response or the covariates just because their histograms don\u2019t look Gaussian.<\/p>\n<p>That\u2019s wrong, and it can even be misleading.<\/p>\n<p>The normality assumption in regression does not concern the marginal variables, the distribution of the <strong>errors<\/strong>, and therefore, <span class=\"math inline\">\\(Y\\mid X_1 \\ldots X_p\\)<\/span>, where <span class=\"math inline\">\\(p\\)<\/span> is the number of predictors.<\/p>\n<p>What truly matters is whether the <strong>residuals<\/strong> behave approximately like Gaussian, once the model is fitted. A non-Gaussian <span class=\"math inline\">\\(Y\\)<\/span> or <span class=\"math inline\">\\(X_j\\)<\/span> (<span class=\"math inline\">\\(j=1,\\ldots p\\)<\/span>) tells us very little about it.<\/p>\n<p>Preventive transformations are rarely help and often distort what matters. <strong>Fit the model first, inspect the residuals, and then decide if a transformation makes sense.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>Let\u2019s give a look to an example.<\/p>\n<p>Consider the following Data Generating Process:<\/p>\n<p><span class=\"math display\">\\[<br \/>\n\\left\\{<br \/>\n\\begin{aligned}<br \/>\nX_{1i} &amp;\\sim \\mathrm{Ber}(\\pi),\\\\<br \/>\nX_{2i} &amp;\\sim \\chi_1^2,\\\\<br \/>\nY_i &amp;= \\beta_0 + \\beta_1 X_{1i} + \\beta_2 X_{2i} +<br \/>\n\\varepsilon_i,\\qquad \\varepsilon_i \\sim N\\!\\left(0,\\tfrac{1}{4}\\right),<br \/>\n\\; i=1,\\dots,n.<br \/>\n\\end{aligned}<br \/>\n\\right.<br \/>\n\\]<\/span><\/p>\n<p>with <span class=\"math inline\">\\(\\beta_0 = 0\\)<\/span>, <span class=\"math inline\">\\(\\beta_1 = -3\\)<\/span>, and <span class=\"math inline\">\\(\\beta_2 = 1.5\\)<\/span>.<\/p>\n<p>All the assumptions for a linear regression model are perfectly satisfied here, as the errors are normal distributed and independent of the predictors.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin: auto;\" src=\"https:\/\/datascience.unifi.it\/wp-content\/uploads\/2025\/11\/tip1-chunk-2-1.png\" width=\"70%\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Scary, isn\u2019t it? Don\u2019t worry, just fit a linear model and check the residuals! Here is what it comes out:<\/p>\n<p>&nbsp;<\/p>\n<div style=\"box-shadow: 0 6px 20px rgba(0,0,0,.12), 0 2px 6px rgba(0,0,0,.08); border-radius: 10px; background: #fff; padding: 12px 16px; margin: 1rem auto; max-width: 80%;\">\n<pre style=\"margin: 0; white-space: pre-wrap; font-family: 'Fira Code',ui-monospace,SFMono-Regular,Menlo,Consolas,'Liberation Mono',monospace; font-size: 0.8em; color: #111; background-color: #bfcbea;\">Call:\r\nlm(formula = Y ~ X1 + X2)\r\nResiduals:\r\n     Min       1Q   Median       3Q      Max \r\n-1.33657 -0.31524  0.01153  0.30351  1.56493 \r\nCoefficients:\r\n            Estimate Std. Error t value Pr(&gt;|t|)    \r\n(Intercept)  0.03545    0.03938    0.90    0.369    \r\nX1          -3.05142    0.04826  -63.23   &lt;2e-16 ***\r\nX2           1.48158    0.01627   91.09   &lt;2e-16 ***\r\n---\r\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\r\nResidual standard error: 0.482 on 397 degrees of freedom\r\nMultiple R-squared:  0.9687,    Adjusted R-squared:  0.9685 \r\nF-statistic:  6140 on 2 and 397 DF,  p-value: &lt; 2.2e-16\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The summary appears ok and look at the plot of the residuals! They are Gaussian, as expected from the Data Generating Process.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin: auto;\" src=\"https:\/\/datascience.unifi.it\/wp-content\/uploads\/2025\/11\/tip1-chunk-4-1.png\" width=\"70%\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>A common mistake would have been to transform <span class=\"math inline\">\\(Y\\)<\/span> or <span class=\"math inline\">\\(X_2\\)<\/span> because they did not appear normally distributed, because of the skewness on the marginal histogram\/density plots.<\/p>\n<p>It would have been a mess!<\/p>\n<p>Let\u2019s do a test to see if this is true. Imagine we had panicked after seeing the skewed histogram of <span class=\"math inline\">\\(Y\\)<\/span> and <span class=\"math inline\">\\(X_2\\)<\/span>, and decided to <em>fix<\/em> it with a log transformation. That\u2019s a typical preprocessing overreaction \u2014 and here\u2019s the result where <span class=\"math inline\">\\(Y^*_i = \\log(Y_i &#8211; \\min(Y) + 1)\\)<\/span> and <span class=\"math inline\">\\(X_{2i}^* = \\log(X_{2i})\\)<\/span>.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"box-shadow: 0 6px 20px rgba(0,0,0,.12), 0 2px 6px rgba(0,0,0,.08); border-radius: 10px; background: #fff; padding: 12px 16px; margin: 1rem auto; max-width: 80%;\">\n<pre style=\"margin: 0; white-space: pre-wrap; font-family: 'Fira Code',ui-monospace,SFMono-Regular,Menlo,Consolas,'Liberation Mono',monospace; font-size: 0.8em; color: #111; background-color: #bfcbea;\">Call:\r\nlm(formula = Y_log ~ X1 + X2_log)\r\nResiduals:\r\n     Min       1Q   Median       3Q      Max \r\n-0.81084 -0.14946 -0.04069  0.12247  0.96413 \r\nCoefficients:\r\n             Estimate Std. Error t value Pr(&gt;|t|)    \r\n(Intercept)  2.044469   0.019919  102.64   &lt;2e-16 ***\r\nX1          -0.622621   0.026167  -23.79   &lt;2e-16 ***\r\nX2_log       0.142818   0.006458   22.11   &lt;2e-16 ***\r\n---\r\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\r\nResidual standard error: 0.2611 on 397 degrees of freedom\r\nMultiple R-squared:  0.7356,    Adjusted R-squared:  0.7343 \r\nF-statistic: 552.3 on 2 and 397 DF,  p-value: &lt; 2.2e-16\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The fitting is actually worse, with R-squared decresd from (0.9687) to (0.7343)!<\/p>\n<p>Let\u2019s see what happens to the residuals.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin: 1rem auto; box-shadow: 0 6px 20px rgba(0,0,0,.12), 0 2px 6px rgba(0,0,0,.08); border-radius: 10px; background: #fff;\" src=\"https:\/\/datascience.unifi.it\/wp-content\/uploads\/2025\/11\/tip1-chunk-6-1.png\" width=\"70%\" \/><\/p>\n<p>They are so much worse, right? After the transformation, the residuals are no longer Gaussian and the model coefficients are not close to the <em>true<\/em> ones. By transforming <span class=\"math inline\">\\(Y\\)<\/span> and <span class=\"math inline\">\\(X_2\\)<\/span> we\u2019ve broken the linear relationship present in the Data Generating Process.<\/p>\n<p>&nbsp;<\/p>\n<blockquote><p><strong>Takeaway:<\/strong><\/p>\n<p>If it\u2019s not broken, don\u2019t fix it. Check the residuals, not the marginals.<\/p><\/blockquote>\n<p>Curious about what to check after residuals? Stay tuned for future posts!<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>FDS\u2019 Think tank &nbsp; Welcome to our Tips &amp; Tricks page! A space where we share quick insights, methods, and good habits with data scientists. Good data science isn\u2019t just &#8230;<\/p>\n","protected":false},"author":46,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"class_list":["post-6247","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/pages\/6247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/users\/46"}],"replies":[{"embeddable":true,"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/comments?post=6247"}],"version-history":[{"count":69,"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/pages\/6247\/revisions"}],"predecessor-version":[{"id":6562,"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/pages\/6247\/revisions\/6562"}],"wp:attachment":[{"href":"https:\/\/datascience.unifi.it\/index.php\/wp-json\/wp\/v2\/media?parent=6247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}