/************************************************* * Perl-Compatible Regular Expressions * *************************************************/ /* PCRE is a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge New API code Copyright (c) 2016-2023 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the University of Cambridge nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ----------------------------------------------------------------------------- */ /* This module contains functions for scanning a compiled pattern and collecting data (e.g. minimum matching length). */ #ifdef HAVE_CONFIG_H #include "config.h" #endif #include "pcre2_internal.h" /* The maximum remembered capturing brackets minimum. */ #define MAX_CACHE_BACKREF … /* Set a bit in the starting code unit bit map. */ #define SET_BIT(c) … /* Returns from set_start_bits() */ enum { … }; /************************************************* * Find the minimum subject length for a group * *************************************************/ /* Scan a parenthesized group and compute the minimum length of subject that is needed to match it. This is a lower bound; it does not mean there is a string of that length that matches. In UTF mode, the result is in characters rather than code units. The field in a compiled pattern for storing the minimum length is 16-bits long (on the grounds that anything longer than that is pathological), so we give up when we reach that amount. This also means that integer overflow for really crazy patterns cannot happen. Backreference minimum lengths are cached to speed up multiple references. This function is called only when the highest back reference in the pattern is less than or equal to MAX_CACHE_BACKREF, which is one less than the size of the caching vector. The zeroth element contains the number of the highest set value. Arguments: re compiled pattern block code pointer to start of group (the bracket) startcode pointer to start of the whole pattern's code utf UTF flag recurses chain of recurse_check to catch mutual recursion countptr pointer to call count (to catch over complexity) backref_cache vector for caching back references. This function is no longer called when the pattern contains (*ACCEPT); however, the old code for returning -1 is retained, just in case. Returns: the minimum length -1 \C in UTF-8 mode or (*ACCEPT) or pattern too complicated -2 internal error (missing capturing bracket) -3 internal error (opcode not listed) */ static int find_minlength(const pcre2_real_code *re, PCRE2_SPTR code, PCRE2_SPTR startcode, BOOL utf, recurse_check *recurses, int *countptr, int *backref_cache) { … } /************************************************* * Set a bit and maybe its alternate case * *************************************************/ /* Given a character, set its first code unit's bit in the table, and also the corresponding bit for the other version of a letter if we are caseless. Arguments: re points to the regex block p points to the first code unit of the character caseless TRUE if caseless utf TRUE for UTF mode ucp TRUE for UCP mode Returns: pointer after the character */ static PCRE2_SPTR set_table_bit(pcre2_real_code *re, PCRE2_SPTR p, BOOL caseless, BOOL utf, BOOL ucp) { … } /************************************************* * Set bits for a positive character type * *************************************************/ /* This function sets starting bits for a character type. In UTF-8 mode, we can only do a direct setting for bytes less than 128, as otherwise there can be confusion with bytes in the middle of UTF-8 characters. In a "traditional" environment, the tables will only recognize ASCII characters anyway, but in at least one Windows environment, some higher bytes bits were set in the tables. So we deal with that case by considering the UTF-8 encoding. Arguments: re the regex block cbit type the type of character wanted table_limit 32 for non-UTF-8; 16 for UTF-8 Returns: nothing */ static void set_type_bits(pcre2_real_code *re, int cbit_type, unsigned int table_limit) { … } /************************************************* * Set bits for a negative character type * *************************************************/ /* This function sets starting bits for a negative character type such as \D. In UTF-8 mode, we can only do a direct setting for bytes less than 128, as otherwise there can be confusion with bytes in the middle of UTF-8 characters. Unlike in the positive case, where we can set appropriate starting bits for specific high-valued UTF-8 characters, in this case we have to set the bits for all high-valued characters. The lowest is 0xc2, but we overkill by starting at 0xc0 (192) for simplicity. Arguments: re the regex block cbit type the type of character wanted table_limit 32 for non-UTF-8; 16 for UTF-8 Returns: nothing */ static void set_nottype_bits(pcre2_real_code *re, int cbit_type, unsigned int table_limit) { … } /************************************************* * Create bitmap of starting code units * *************************************************/ /* This function scans a compiled unanchored expression recursively and attempts to build a bitmap of the set of possible starting code units whose values are less than 256. In 16-bit and 32-bit mode, values above 255 all cause the 255 bit to be set. When calling set[_not]_type_bits() in UTF-8 (sic) mode we pass a value of 16 rather than 32 as the final argument. (See comments in those functions for the reason.) The SSB_CONTINUE return is useful for parenthesized groups in patterns such as (a*)b where the group provides some optional starting code units but scanning must continue at the outer level to find at least one mandatory code unit. At the outermost level, this function fails unless the result is SSB_DONE. We restrict recursion (for nested groups) to 1000 to avoid stack overflow issues. Arguments: re points to the compiled regex block code points to an expression utf TRUE if in UTF mode ucp TRUE if in UCP mode depthptr pointer to recurse depth Returns: SSB_FAIL => Failed to find any starting code units SSB_DONE => Found mandatory starting code units SSB_CONTINUE => Found optional starting code units SSB_UNKNOWN => Hit an unrecognized opcode SSB_TOODEEP => Recursion is too deep */ static int set_start_bits(pcre2_real_code *re, PCRE2_SPTR code, BOOL utf, BOOL ucp, int *depthptr) { … } /************************************************* * Study a compiled expression * *************************************************/ /* This function is handed a compiled expression that it must study to produce information that will speed up the matching. Argument: re points to the compiled expression Returns: 0 normally; non-zero should never normally occur 1 unknown opcode in set_start_bits 2 missing capturing bracket 3 unknown opcode in find_minlength */ int PRIV(study)(pcre2_real_code *re) { … } /* End of pcre2_study.c */